Vous êtes sur la page 1sur 647


Hadoop: The Definitive Guide
Tom Whitc
Beijing • Cambridge • Farnham • Köln • Sebastopol • Tokyo
Hadoop: The Definitive Guide, Third Edition
Ly Tom Vhite
Revision History for the :
2012-01-27 Eaily ielease ievision 1
See http://orci||y.con/cata|og/crrata.csp?isbn=9781119311520 loi ielease uetails.
ISBN: 97S-1-++9-31152-0
Ior E|ianc, Eni|ia, and Lottic
Table of Contents
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Meet Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Data! 1
Data Stoiage anu Analysis 3
Compaiison with Othei Systems +
Giiu Computing 6
Volunteei Computing S
A Biiel Histoiy ol Hauoop 9
Apache Hauoop anu the Hauoop Ecosystem 12
Hauoop Releases 13
Vhat`s Coveieu in this Book 1+
CompatiLility 15
2. MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
A Veathei Dataset 17
Data Foimat 17
Analyzing the Data with Unix Tools 19
Analyzing the Data with Hauoop 20
Map anu Reuuce 20
]ava MapReuuce 22
Scaling Out 30
Data Flow 31
ComLinei Functions 3+
Running a DistiiLuteu MapReuuce ]oL 37
Hauoop Stieaming 37
RuLy 37
Python +0
Hauoop Pipes +1
Compiling anu Running +2
3. The Hadoop Distributed Filesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
The Design ol HDFS +5
HDFS Concepts +7
Blocks +7
Namenoues anu Datanoues +S
HDFS Feueiation +9
HDFS High-AvailaLility 50
The Commanu-Line Inteilace 51
Basic Filesystem Opeiations 52
Hauoop Filesystems 5+
Inteilaces 55
The ]ava Inteilace 57
Reauing Data liom a Hauoop URL 57
Reauing Data Using the FileSystem API 59
Viiting Data 62
Diiectoiies 6+
Queiying the Filesystem 6+
Deleting Data 69
Data Flow 69
Anatomy ol a File Reau 69
Anatomy ol a File Viite 72
Coheiency Mouel 75
Paiallel Copying with uistcp 76
Keeping an HDFS Clustei Balanceu 7S
Hauoop Aichives 7S
Using Hauoop Aichives 79
Limitations S0
4. Hadoop I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Data Integiity S3
Data Integiity in HDFS S3
LocalFileSystem S+
ChecksumFileSystem S5
Compiession S5
Couecs S7
Compiession anu Input Splits 91
Using Compiession in MapReuuce 92
Seiialization 9+
The ViitaLle Inteilace 95
ViitaLle Classes 9S
iv | Table of Contents
Implementing a Custom ViitaLle 105
Seiialization Fiamewoiks 110
Avio 112
File-Baseu Data Stiuctuies 132
SeguenceFile 132
MapFile 139
5. Developing a MapReduce Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
The Conliguiation API 1+6
ComLining Resouices 1+7
VaiiaLle Expansion 1+S
Conliguiing the Development Enviionment 1+S
Managing Conliguiation 1+S
GeneiicOptionsPaisei, Tool, anu ToolRunnei 151
Viiting a Unit Test 15+
Mappei 15+
Reuucei 156
Running Locally on Test Data 157
Running a ]oL in a Local ]oL Runnei 157
Testing the Diivei 161
Running on a Clustei 162
Packaging 162
Launching a ]oL 162
The MapReuuce VeL UI 16+
Retiieving the Results 167
DeLugging a ]oL 169
Hauoop Logs 173
Remote DeLugging 175
Tuning a ]oL 176
Pioliling Tasks 177
MapReuuce Voikllows 1S0
Decomposing a PioLlem into MapReuuce ]oLs 1S0
]oLContiol 1S2
Apache Oozie 1S2
6. How MapReduce Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Anatomy ol a MapReuuce ]oL Run 1S7
Classic MapReuuce (MapReuuce 1) 1SS
YARN (MapReuuce 2) 19+
Failuies 200
Failuies in Classic MapReuuce 200
Failuies in YARN 202
]oL Scheuuling 20+
Table of Contents | v
The Faii Scheuulei 205
The Capacity Scheuulei 205
Shullle anu Soit 205
The Map Siue 206
The Reuuce Siue 207
Conliguiation Tuning 209
Task Execution 212
The Task Execution Enviionment 212
Speculative Execution 213
Output Committeis 215
Task ]VM Reuse 216
Skipping Bau Recoius 217
7. MapReduce Types and Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
MapReuuce Types 221
The Delault MapReuuce ]oL 225
Input Foimats 232
Input Splits anu Recoius 232
Text Input 2+3
Binaiy Input 2+7
Multiple Inputs 2+S
DataLase Input (anu Output) 2+9
Output Foimats 2+9
Text Output 250
Binaiy Output 251
Multiple Outputs 251
Lazy Output 255
DataLase Output 256
8. MapReduce Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
Counteis 257
Built-in Counteis 257
Usei-Delineu ]ava Counteis 262
Usei-Delineu Stieaming Counteis 266
Soiting 266
Piepaiation 266
Paitial Soit 26S
Total Soit 272
Seconuaiy Soit 276
]oins 2S1
Map-Siue ]oins 2S2
Reuuce-Siue ]oins 2S+
Siue Data DistiiLution 2S7
vi | Table of Contents
Using the ]oL Conliguiation 2S7
DistiiLuteu Cache 2SS
MapReuuce LiLiaiy Classes 29+
9. Setting Up a Hadoop Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
Clustei Specilication 295
Netwoik Topology 297
Clustei Setup anu Installation 299
Installing ]ava 300
Cieating a Hauoop Usei 300
Installing Hauoop 300
Testing the Installation 301
SSH Conliguiation 301
Hauoop Conliguiation 302
Conliguiation Management 303
Enviionment Settings 305
Impoitant Hauoop Daemon Piopeities 309
Hauoop Daemon Auuiesses anu Poits 31+
Othei Hauoop Piopeities 315
Usei Account Cieation 31S
YARN Conliguiation 31S
Impoitant YARN Daemon Piopeities 319
YARN Daemon Auuiesses anu Poits 322
Secuiity 323
KeiLeios anu Hauoop 32+
Delegation Tokens 326
Othei Secuiity Enhancements 327
Benchmaiking a Hauoop Clustei 329
Hauoop Benchmaiks 329
Usei ]oLs 331
Hauoop in the Clouu 332
Hauoop on Amazon EC2 332
10. Administering Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
HDFS 337
Peisistent Data Stiuctuies 337
Sale Moue 3+2
Auuit Logging 3++
Tools 3++
Monitoiing 3+9
Logging 3+9
Metiics 350
]ava Management Extensions 353
Table of Contents | vii
Maintenance 355
Routine Auministiation Pioceuuies 355
Commissioning anu Decommissioning Noues 357
Upgiaues 360
11. Pig . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
Installing anu Running Pig 366
Execution Types 366
Running Pig Piogiams 36S
Giunt 36S
Pig Latin Euitois 369
An Example 369
Geneiating Examples 371
Compaiison with DataLases 372
Pig Latin 373
Stiuctuie 373
Statements 375
Expiessions 379
Types 3S0
Schemas 3S2
Functions 3S6
Macios 3SS
Usei-Delineu Functions 3S9
A Filtei UDF 3S9
An Eval UDF 392
A Loau UDF 39+
Data Piocessing Opeiatois 397
Loauing anu Stoiing Data 397
Filteiing Data 397
Giouping anu ]oining Data +00
Soiting Data +05
ComLining anu Splitting Data +06
Pig in Piactice +07
Paiallelism +07
Paiametei SuLstitution +0S
12. Hive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 411
Installing Hive +12
The Hive Shell +13
An Example +1+
Running Hive +15
Conliguiing Hive +15
Hive Seivices +17
viii | Table of Contents
The Metastoie +19
Compaiison with Tiauitional DataLases +21
Schema on Reau Veisus Schema on Viite +21
Upuates, Tiansactions, anu Inuexes +22
HiveQL +22
Data Types +2+
Opeiatois anu Functions +26
TaLles +27
Manageu TaLles anu Exteinal TaLles +27
Paititions anu Buckets +29
Stoiage Foimats +33
Impoiting Data +3S
Alteiing TaLles ++0
Diopping TaLles ++1
Queiying Data ++1
Soiting anu Aggiegating ++1
MapReuuce Sciipts ++2
]oins ++3
SuLgueiies ++6
Views ++7
Usei-Delineu Functions ++S
Viiting a UDF ++9
Viiting a UDAF +51
13. HBase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457
HBasics +57
Backuiop +5S
Concepts +5S
Vhiilwinu Toui ol the Data Mouel +5S
Implementation +59
Installation +62
Test Diive +63
Clients +65
]ava +65
Avio, REST, anu Thiilt +6S
Example +69
Schemas +70
Loauing Data +71
VeL Queiies +7+
HBase Veisus RDBMS +77
Successlul Seivice +7S
HBase +79
Use Case: HBase at Stieamy.com +79
Table of Contents | ix
Piaxis +S1
Veisions +S1
UI +S3
Metiics +S3
Schema Design +S3
Counteis +S+
Bulk Loau +S+
14. ZooKeeper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487
Installing anu Running ZooKeepei +SS
An Example +90
Gioup MemLeiship in ZooKeepei +90
Cieating the Gioup +91
]oining a Gioup +93
Listing MemLeis in a Gioup +9+
Deleting a Gioup +96
The ZooKeepei Seivice +97
Data Mouel +97
Opeiations +99
Implementation 503
Consistency 505
Sessions 507
States 509
Builuing Applications with ZooKeepei 510
A Conliguiation Seivice 510
The Resilient ZooKeepei Application 513
A Lock Seivice 517
Moie DistiiLuteu Data Stiuctuies anu Piotocols 519
ZooKeepei in Piouuction 520
Resilience anu Peiloimance 521
Conliguiation 522
15. Sqoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 525
Getting Sgoop 525
A Sample Impoit 527
Geneiateu Coue 530
Auuitional Seiialization Systems 531
DataLase Impoits: A Deepei Look 531
Contiolling the Impoit 53+
Impoits anu Consistency 53+
Diiect-moue Impoits 53+
Voiking with Impoiteu Data 535
x | Table of Contents
Impoiteu Data anu Hive 536
Impoiting Laige OLjects 53S
Peiloiming an Expoit 5+0
Expoits: A Deepei Look 5+1
Expoits anu Tiansactionality 5+3
Expoits anu SeguenceFiles 5+3
16. Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
Hauoop Usage at Last.lm 5+5
Last.lm: The Social Music Revolution 5+5
Hauoop at Last.lm 5+5
Geneiating Chaits with Hauoop 5+6
The Tiack Statistics Piogiam 5+7
Summaiy 55+
Hauoop anu Hive at FaceLook 55+
Intiouuction 55+
Hauoop at FaceLook 55+
Hypothetical Use Case Stuuies 557
Hive 560
PioLlems anu Futuie Voik 56+
Nutch Seaich Engine 565
Backgiounu 565
Data Stiuctuies 566
Selecteu Examples ol Hauoop Data Piocessing in Nutch 569
Summaiy 57S
Log Piocessing at Rackspace 579
Reguiiements/The PioLlem 579
Biiel Histoiy 5S0
Choosing Hauoop 5S0
Collection anu Stoiage 5S0
MapReuuce loi Logs 5S1
Cascauing 5S7
Fielus, Tuples, anu Pipes 5SS
Opeiations 590
Taps, Schemes, anu Flows 592
Cascauing in Piactice 593
FlexiLility 596
Hauoop anu Cascauing at ShaieThis 597
Summaiy 600
TeiaByte Soit on Apache Hauoop 601
Using Pig anu Vukong to Exploie Billion-euge Netwoik Giaphs 60+
Measuiing Community 606
EveiyLouy`s Talkin` at Me: The Twittei Reply Giaph 606
Table of Contents | xi
Symmetiic Links 609
Community Extiaction 610
A. Installing Apache Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613
B. Cloudera’s Distribution for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619
C. Preparing the NCDC Weather Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 621
xii | Table of Contents
Hauoop got its stait in Nutch. A lew ol us weie attempting to Luilu an open souice
weL seaich engine anu having tiouLle managing computations iunning on even a
hanulul ol computeis. Once Google puLlisheu its GFS anu MapReuuce papeis, the
ioute Lecame cleai. They`u ueviseu systems to solve piecisely the pioLlems we weie
having with Nutch. So we staiteu, two ol us, hall-time, to tiy to ie-cieate these systems
as a pait ol Nutch.
Ve manageu to get Nutch limping along on 20 machines, Lut it soon Lecame cleai that
to hanule the VeL`s massive scale, we`u neeu to iun it on thousanus ol machines anu,
moieovei, that the joL was Liggei than two hall-time uevelopeis coulu hanule.
Aiounu that time, Yahoo! got inteiesteu, anu guickly put togethei a team that I joineu.
Ve split oll the uistiiLuteu computing pait ol Nutch, naming it Hauoop. Vith the help
ol Yahoo!, Hauoop soon giew into a technology that coulu tiuly scale to the VeL.
In 2006, Tom Vhite staiteu contiiLuting to Hauoop. I alieauy knew Tom thiough an
excellent aiticle he`u wiitten aLout Nutch, so I knew he coulu piesent complex iueas
in cleai piose. I soon leaineu that he coulu also uevelop soltwaie that was as pleasant
to ieau as his piose.
Fiom the Leginning, Tom`s contiiLutions to Hauoop showeu his concein loi useis anu
loi the pioject. Unlike most open souice contiiLutois, Tom is not piimaiily inteiesteu
in tweaking the system to Lettei meet his own neeus, Lut iathei in making it easiei loi
anyone to use.
Initially, Tom specializeu in making Hauoop iun well on Amazon`s EC2 anu S3 seiv-
ices. Then he moveu on to tackle a wiue vaiiety ol pioLlems, incluuing impioving the
MapReuuce APIs, enhancing the weLsite, anu uevising an oLject seiialization liame-
woik. In all cases, Tom piesenteu his iueas piecisely. In shoit oiuei, Tom eaineu the
iole ol Hauoop committei anu soon theiealtei Lecame a memLei ol the Hauoop Pioject
Management Committee.
Tom is now a iespecteu senioi memLei ol the Hauoop uevelopei community. Though
he`s an expeit in many technical coineis ol the pioject, his specialty is making Hauoop
easiei to use anu unueistanu.
Given this, I was veiy pleaseu when I leaineu that Tom intenueu to wiite a Look aLout
Hauoop. Vho coulu Le Lettei gualilieu? Now you have the oppoitunity to leain aLout
Hauoop liom a mastei÷not only ol the technology, Lut also ol common sense anu
plain talk.
÷Doug Cutting
Sheu in the Yaiu, Caliloinia
xiv | Foreword
Maitin Gaiunei, the mathematics anu science wiitei, once saiu in an inteiview:
Beyonu calculus, I am lost. That was the seciet ol my column`s success. It took me so
long to unueistanu what I was wiiting aLout that I knew how to wiite in a way most
ieaueis woulu unueistanu.
In many ways, this is how I leel aLout Hauoop. Its innei woikings aie complex, iesting
as they uo on a mixtuie ol uistiiLuteu systems theoiy, piactical engineeiing, anu com-
mon sense. Anu to the uninitiateu, Hauoop can appeai alien.
But it uoesn`t neeu to Le like this. Stiippeu to its coie, the tools that Hauoop pioviues
loi Luiluing uistiiLuteu systems÷loi uata stoiage, uata analysis, anu cooiuination÷
aie simple. Il theie`s a common theme, it is aLout iaising the level ol aLstiaction÷to
cieate Luiluing Llocks loi piogiammeis who just happen to have lots ol uata to stoie,
oi lots ol uata to analyze, oi lots ol machines to cooiuinate, anu who uon`t have the
time, the skill, oi the inclination to Lecome uistiiLuteu systems expeits to Luilu the
inliastiuctuie to hanule it.
Vith such a simple anu geneially applicaLle leatuie set, it seemeu oLvious to me when
I staiteu using it that Hauoop ueseiveu to Le wiuely useu. Howevei, at the time (in
eaily 2006), setting up, conliguiing, anu wiiting piogiams to use Hauoop was an ait.
Things have ceitainly impioveu since then: theie is moie uocumentation, theie aie
moie examples, anu theie aie thiiving mailing lists to go to when you have guestions.
Anu yet the Liggest huiule loi newcomeis is unueistanuing what this technology is
capaLle ol, wheie it excels, anu how to use it. That is why I wiote this Look.
The Apache Hauoop community has come a long way. Ovei the couise ol thiee yeais,
the Hauoop pioject has Llossomeu anu spun oll hall a uozen suLpiojects. In this time,
the soltwaie has maue gieat leaps in peiloimance, ieliaLility, scalaLility, anu manage-
aLility. To gain even wiuei auoption, howevei, I Lelieve we neeu to make Hauoop even
easiei to use. This will involve wiiting moie tools; integiating with moie systems; anu
1. ¨The science ol lun,¨ Alex Bellos, Thc Guardian, May 31, 200S, http://www.guardian.co.u|/scicncc/
wiiting new, impioveu APIs. I`m looking loiwaiu to Leing a pait ol this, anu I hope
this Look will encouiage anu enaLle otheis to uo so, too.
Administrative Notes
Duiing uiscussion ol a paiticulai ]ava class in the text, I olten omit its package name,
to ieuuce cluttei. Il you neeu to know which package a class is in, you can easily look
it up in Hauoop`s ]ava API uocumentation loi the ielevant suLpioject, linkeu to liom
the Apache Hauoop home page at http://hadoop.apachc.org/. Oi il you`ie using an IDE,
it can help using its auto-complete mechanism.
Similaily, although it ueviates liom usual style guiuelines, piogiam listings that impoit
multiple classes liom the same package may use the asteiisk wilucaiu chaiactei to save
space (loi example: import org.apache.hadoop.io.*).
The sample piogiams in this Look aie availaLle loi uownloau liom the weLsite that
accompanies this Look: http://www.hadoopboo|.con/. You will also linu instiuctions
theie loi oLtaining the uatasets that aie useu in examples thioughout the Look, as well
as luithei notes loi iunning the piogiams in the Look, anu links to upuates, auuitional
iesouices, anu my Llog.
What’s in This Book?
The iest ol this Look is oiganizeu as lollows. Chaptei 1 emphasizes the neeu loi Hauoop
anu sketches the histoiy ol the pioject. Chaptei 2 pioviues an intiouuction to
MapReuuce. Chaptei 3 looks at Hauoop lilesystems, anu in paiticulai HDFS, in uepth.
Chaptei + coveis the lunuamentals ol I/O in Hauoop: uata integiity, compiession,
seiialization, anu lile-Laseu uata stiuctuies.
The next loui chapteis covei MapReuuce in uepth. Chaptei 5 goes thiough the piactical
steps neeueu to uevelop a MapReuuce application. Chaptei 6 looks at how MapReuuce
is implementeu in Hauoop, liom the point ol view ol a usei. Chaptei 7 is aLout the
MapReuuce piogiamming mouel, anu the vaiious uata loimats that MapReuuce can
woik with. Chaptei S is on auvanceu MapReuuce topics, incluuing soiting anu joining
Chapteis 9 anu 10 aie loi Hauoop auministiatois, anu uesciiLe how to set up anu
maintain a Hauoop clustei iunning HDFS anu MapReuuce.
Latei chapteis aie ueuicateu to piojects that Luilu on Hauoop oi aie ielateu to it.
Chapteis 11 anu 12 piesent Pig anu Hive, which aie analytics platloims Luilt on HDFS
anu MapReuuce, wheieas Chapteis 13, 1+, anu 15 covei HBase, ZooKeepei, anu
Sgoop, iespectively.
Finally, Chaptei 16 is a collection ol case stuuies contiiLuteu Ly memLeis ol the Apache
Hauoop community.
xvi | Preface
What’s New in the Second Edition?
The seconu euition has two new chapteis on Hive anu Sgoop (Chapteis 12 anu 15), a
new section coveiing Avio (in Chaptei +), an intiouuction to the new secuiity leatuies
in Hauoop (in Chaptei 9), anu a new case stuuy on analyzing massive netwoik giaphs
using Hauoop (in Chaptei 16).
This euition continues to uesciiLe the 0.20 ielease seiies ol Apache Hauoop, since this
was the latest staLle ielease at the time ol wiiting. New leatuies liom latei ieleases aie
occasionally mentioneu in the text, howevei, with ieleience to the veision that they
weie intiouuceu in.
Conventions Used in This Book
The lollowing typogiaphical conventions aie useu in this Look:
Inuicates new teims, URLs, email auuiesses, lilenames, anu lile extensions.
Constant width
Useu loi piogiam listings, as well as within paiagiaphs to ielei to piogiam elements
such as vaiiaLle oi lunction names, uataLases, uata types, enviionment vaiiaLles,
statements, anu keywoius.
Constant width bold
Shows commanus oi othei text that shoulu Le typeu liteially Ly the usei.
Constant width italic
Shows text that shoulu Le ieplaceu with usei-supplieu values oi Ly values uetei-
mineu Ly context.
This icon signilies a tip, suggestion, oi geneial note.
This icon inuicates a waining oi caution.
Using Code Examples
This Look is heie to help you get youi joL uone. In geneial, you may use the coue in
this Look in youi piogiams anu uocumentation. You uo not neeu to contact us loi
peimission unless you`ie iepiouucing a signilicant poition ol the coue. Foi example,
wiiting a piogiam that uses seveial chunks ol coue liom this Look uoes not ieguiie
peimission. Selling oi uistiiLuting a CD-ROM ol examples liom O`Reilly Looks uoes
Preface | xvii
ieguiie peimission. Answeiing a guestion Ly citing this Look anu guoting example
coue uoes not ieguiie peimission. Incoipoiating a signilicant amount ol example coue
liom this Look into youi piouuct`s uocumentation uoes ieguiie peimission.
Ve appieciate, Lut uo not ieguiie, attiiLution. An attiiLution usually incluues the title,
authoi, puLlishei, anu ISBN. Foi example: ¨Hadoop: Thc Dcjinitivc Guidc, Seconu
Euition, Ly Tom Vhite. Copyiight 2011 Tom Vhite, 97S-1-++9-3S973-+.¨
Il you leel youi use ol coue examples lalls outsiue laii use oi the peimission given aLove,
leel liee to contact us at pcrnissions¿orci||y.con.
Safari® Books Online
Salaii Books Online is an on-uemanu uigital liLiaiy that lets you easily
seaich ovei 7,500 technology anu cieative ieleience Looks anu viueos to
linu the answeis you neeu guickly.
Vith a suLsciiption, you can ieau any page anu watch any viueo liom oui liLiaiy online.
Reau Looks on youi cell phone anu moLile uevices. Access new titles Leloie they aie
availaLle loi piint, anu get exclusive access to manusciipts in uevelopment anu post
leeuLack loi the authois. Copy anu paste coue samples, oiganize youi lavoiites, uown-
loau chapteis, Lookmaik key sections, cieate notes, piint out pages, anu Lenelit liom
tons ol othei time-saving leatuies.
O`Reilly Meuia has uploaueu this Look to the Salaii Books Online seivice. To have lull
uigital access to this Look anu otheis on similai topics liom O`Reilly anu othei puL-
lisheis, sign up loi liee at http://ny.sajariboo|son|inc.con.
How to Contact Us
Please auuiess comments anu guestions conceining this Look to the puLlishei:
O`Reilly Meuia, Inc.
1005 Giavenstein Highway Noith
SeLastopol, CA 95+72
S00-99S-993S (in the Uniteu States oi Canaua)
707-S29-0515 (inteinational oi local)
707-S29-010+ (lax)
Ve have a weL page loi this Look, wheie we list eiiata, examples, anu any auuitional
inloimation. You can access this page at:
The authoi also has a site loi this Look at:
xviii | Preface
To comment oi ask technical guestions aLout this Look, senu email to:
Foi moie inloimation aLout oui Looks, conleiences, Resouice Centeis, anu the
O`Reilly Netwoik, see oui weLsite at:
I have ielieu on many people, Loth uiiectly anu inuiiectly, in wiiting this Look. I woulu
like to thank the Hauoop community, liom whom I have leaineu, anu continue to leain,
a gieat ueal.
In paiticulai, I woulu like to thank Michael Stack anu ]onathan Giay loi wiiting the
chaptei on HBase. Also thanks go to Auiian Voouheau, Maic ue Palol, ]oyueep Sen
Saima, Ashish Thusoo, Anuizej Bialecki, Stu Hoou, Chiis K. Vensel, anu Owen
O`Malley loi contiiLuting case stuuies loi Chaptei 16.
I woulu like to thank the lollowing ievieweis who contiiLuteu many helplul suggestions
anu impiovements to my uialts: Raghu Angaui, Matt Biuuulph, Chiistophe Bisciglia,
Ryan Cox, Devaiaj Das, Alex Doiman, Chiis Douglas, Alan Gates, Lais Geoige, Patiick
Hunt, Aaion KimLall, Petei Kiey, Haiiong Kuang, Simon Maxen, Olga Natkovich,
Benjamin Reeu, Konstantin Shvachko, Allen Vittenauei, Matei Zahaiia, anu Philip
Zeyligei. Ajay Ananu kept the ieview piocess llowing smoothly. Philip (¨llip¨) Kiomei
kinuly helpeu me with the NCDC weathei uataset leatuieu in the examples in this Look.
Special thanks to Owen O`Malley anu Aiun C. Muithy loi explaining the intiicacies ol
the MapReuuce shullle to me. Any eiiois that iemain aie, ol couise, to Le laiu at my
Foi the seconu euition, I owe a ueLt ol giatituue loi the uetaileu ieview anu leeuLack
liom ]ell Bean, Doug Cutting, Glynn Duiham, Alan Gates, ]ell HammeiLachei, Alex
Kozlov, Ken Kiuglei, ]immy Lin, Touu Lipcon, Saiah Spioehnle, Vinithia Vaiauhaia-
jan, anu Ian Viigley, as well as all the ieaueis who suLmitteu eiiata loi the liist euition.
I woulu also like to thank Aaion KimLall loi contiiLuting the chaptei on Sgoop, anu
Philip (¨llip¨) Kiomei loi the case stuuy on giaph piocessing.
I am paiticulaily giatelul to Doug Cutting loi his encouiagement, suppoit, anu liienu-
ship, anu loi contiiLuting the loiewoiu.
Thanks also go to the many otheis with whom I have hau conveisations oi email
uiscussions ovei the couise ol wiiting the Look.
Hallway thiough wiiting this Look, I joineu Clouueia, anu I want to thank my
colleagues loi Leing incieuiLly suppoitive in allowing me the time to wiite, anu to get
it linisheu piomptly.
Preface | xix
I am giatelul to my euitoi, Mike Loukiues, anu his colleagues at O`Reilly loi theii help
in the piepaiation ol this Look. Mike has Leen theie thioughout to answei my gues-
tions, to ieau my liist uialts, anu to keep me on scheuule.
Finally, the wiiting ol this Look has Leen a gieat ueal ol woik, anu I coulun`t have uone
it without the constant suppoit ol my lamily. My wile, Eliane, not only kept the home
going, Lut also steppeu in to help ieview, euit, anu chase case stuuies. My uaughteis,
Emilia anu Lottie, have Leen veiy unueistanuing, anu I`m looking loiwaiu to spenuing
lots moie time with all ol them.
xx | Preface
Meet Hadoop
In pioneei uays they useu oxen loi heavy pulling, anu when one ox coulun`t Luuge a log,
they uiun`t tiy to giow a laigei ox. Ve shoulun`t Le tiying loi Liggei computeis, Lut loi
moie systems ol computeis.
÷Giace Hoppei
Ve live in the uata age. It`s not easy to measuie the total volume ol uata stoieu elec-
tionically, Lut an IDC estimate put the size ol the ¨uigital univeise¨ at 0.1S zettaLytes
in 2006, anu is loiecasting a tenlolu giowth Ly 2011 to 1.S zettaLytes.
A zettaLyte is
Lytes, oi eguivalently one thousanu exaLytes, one million petaLytes, oi one Lillion
teiaLytes. That`s ioughly the same oiuei ol magnituue as one uisk uiive loi eveiy peison
in the woilu.
This lloou ol uata is coming liom many souices. Consiuei the lollowing:
º The New Yoik Stock Exchange geneiates aLout one teiaLyte ol new tiaue uata pei
º FaceLook hosts appioximately 10 Lillion photos, taking up one petaLyte ol stoiage.
º Ancestiy.com, the genealogy site, stoies aiounu 2.5 petaLytes ol uata.
º The Inteinet Aichive stoies aiounu 2 petaLytes ol uata, anu is giowing at a iate ol
20 teiaLytes pei month.
º The Laige Hauion Colliuei neai Geneva, Switzeilanu, will piouuce aLout 15
petaLytes ol uata pei yeai.
1. Fiom Gantz et al., ¨The Diveise anu Explouing Digital Univeise,¨ Maich 200S (http://www.cnc.con/
2. http://www.intc||igcntcntcrprisc.con/showArtic|c.jhtn|?artic|c|D=207800705, http://nashab|c.con/
2008/10/15/jaccboo|-10-bi||ion-photos/, http://b|og.jani|ytrccnagazinc.con/insidcr/|nsidc
-Anccstrycons-TopSccrct-Data-Ccntcr.aspx, anu http://www.archivc.org/about/jaqs.php, http://www
So theie`s a lot ol uata out theie. But you aie pioLaLly wonueiing how it allects you.
Most ol the uata is lockeu up in the laigest weL piopeities (like seaich engines), oi
scientilic oi linancial institutions, isn`t it? Does the auvent ol ¨Big Data,¨ as it is Leing
calleu, allect smallei oiganizations oi inuiviuuals?
I aigue that it uoes. Take photos, loi example. My wile`s gianulathei was an aviu
photogiaphei, anu took photogiaphs thioughout his auult lile. His entiie coipus ol
meuium loimat, sliue, anu 35mm lilm, when scanneu in at high-iesolution, occupies
aiounu 10 gigaLytes. Compaie this to the uigital photos that my lamily took in 200S,
which take up aLout 5 gigaLytes ol space. My lamily is piouucing photogiaphic uata
at 35 times the iate my wile`s gianulathei`s uiu, anu the iate is incieasing eveiy yeai as
it Lecomes easiei to take moie anu moie photos.
Moie geneially, the uigital stieams that inuiviuuals aie piouucing aie giowing apace.
Miciosolt Reseaich`s MyLileBits pioject gives a glimpse ol aichiving ol peisonal inloi-
mation that may Lecome commonplace in the neai lutuie. MyLileBits was an expeii-
ment wheie an inuiviuual`s inteiactions÷phone calls, emails, uocuments÷weie cap-
tuieu electionically anu stoieu loi latei access. The uata gatheieu incluueu a photo
taken eveiy minute, which iesulteu in an oveiall uata volume ol one gigaLyte a month.
Vhen stoiage costs come uown enough to make it leasiLle to stoie continuous auuio
anu viueo, the uata volume loi a lutuie MyLileBits seivice will Le many times that.
The tienu is loi eveiy inuiviuual`s uata lootpiint to giow, Lut peihaps moie impoitant,
the amount ol uata geneiateu Ly machines will Le even gieatei than that geneiateu Ly
people. Machine logs, RFID ieaueis, sensoi netwoiks, vehicle GPS tiaces, ietail
tiansactions÷all ol these contiiLute to the giowing mountain ol uata.
The volume ol uata Leing maue puLlicly availaLle incieases eveiy yeai, too. Oiganiza-
tions no longei have to meiely manage theii own uata: success in the lutuie will Le
uictateu to a laige extent Ly theii aLility to extiact value liom othei oiganizations` uata.
Initiatives such as PuLlic Data Sets on Amazon VeL Seivices, Inlochimps.oig, anu
theinlo.oig exist to lostei the ¨inloimation commons,¨ wheie uata can Le lieely (oi in
the case ol AVS, loi a mouest piice) shaieu loi anyone to uownloau anu analyze.
Mashups Letween uilleient inloimation souices make loi unexpecteu anu hitheito
unimaginaLle applications.
Take, loi example, the Astiometiy.net pioject, which watches the Astiometiy gioup
on Flicki loi new photos ol the night sky. It analyzes each image anu iuentilies which
pait ol the sky it is liom, as well as any inteiesting celestial Louies, such as stais oi
galaxies. This pioject shows the kinu ol things that aie possiLle when uata (in this case,
taggeu photogiaphic images) is maue availaLle anu useu loi something (image analysis)
that was not anticipateu Ly the cieatoi.
It has Leen saiu that ¨Moie uata usually Leats Lettei algoiithms,¨ which is to say that
loi some pioLlems (such as iecommenuing movies oi music Laseu on past pieleiences),
2 | Chapter 1: Meet Hadoop
howevei lienuish youi algoiithms aie, they can olten Le Leaten simply Ly having moie
uata (anu a less sophisticateu algoiithm).
The goou news is that Big Data is heie. The Lau news is that we aie stiuggling to stoie
anu analyze it.
Data Storage and Analysis
The pioLlem is simple: while the stoiage capacities ol haiu uiives have incieaseu mas-
sively ovei the yeais, access speeus÷the iate at which uata can Le ieau liom uiives÷
have not kept up. One typical uiive liom 1990 coulu stoie 1,370 MB ol uata anu hau
a tianslei speeu ol +.+ MB/s,
so you coulu ieau all the uata liom a lull uiive in aiounu
live minutes. Ovei 20 yeais latei, one teiaLyte uiives aie the noim, Lut the tianslei
speeu is aiounu 100 MB/s, so it takes moie than two anu a hall houis to ieau all the
uata oll the uisk.
This is a long time to ieau all uata on a single uiive÷anu wiiting is even slowei. The
oLvious way to ieuuce the time is to ieau liom multiple uisks at once. Imagine il we
hau 100 uiives, each holuing one hunuieuth ol the uata. Voiking in paiallel, we coulu
ieau the uata in unuei two minutes.
Only using one hunuieuth ol a uisk may seem wastelul. But we can stoie one hunuieu
uatasets, each ol which is one teiaLyte, anu pioviue shaieu access to them. Ve can
imagine that the useis ol such a system woulu Le happy to shaie access in ietuin loi
shoitei analysis times, anu, statistically, that theii analysis joLs woulu Le likely to Le
spieau ovei time, so they woulun`t inteileie with each othei too much.
Theie`s moie to Leing aLle to ieau anu wiite uata in paiallel to oi liom multiple uisks,
The liist pioLlem to solve is haiuwaie lailuie: as soon as you stait using many pieces
ol haiuwaie, the chance that one will lail is laiily high. A common way ol avoiuing uata
loss is thiough ieplication: ieuunuant copies ol the uata aie kept Ly the system so that
in the event ol lailuie, theie is anothei copy availaLle. This is how RAID woiks, loi
instance, although Hauoop`s lilesystem, the Hauoop DistiiLuteu Filesystem (HDFS),
takes a slightly uilleient appioach, as you shall see latei.
The seconu pioLlem is that most analysis tasks neeu to Le aLle to comLine the uata in
some way; uata ieau liom one uisk may neeu to Le comLineu with the uata liom any
ol the othei 99 uisks. Vaiious uistiiLuteu systems allow uata to Le comLineu liom
multiple souices, Lut uoing this coiiectly is notoiiously challenging. MapReuuce pio-
viues a piogiamming mouel that aLstiacts the pioLlem liom uisk ieaus anu wiites,
3. The guote is liom Ananu Rajaiaman wiiting aLout the Netllix Challenge (http://anand.typcpad.con/
datawoc|y/2008/03/norc-data-usua|.htn|). Alon Halevy, Petei Noivig, anu Feinanuo Peieiia make the
same point in ¨The UnieasonaLle Ellectiveness ol Data,¨ IEEE Intelligent Systems, Maich/Apiil 2009.
+. These specilications aie loi the Seagate ST-+1600n.
Data Storage and Analysis | 3
tiansloiming it into a computation ovei sets ol keys anu values. Ve will look at the
uetails ol this mouel in latei chapteis, Lut the impoitant point loi the piesent uiscussion
is that theie aie two paits to the computation, the map anu the ieuuce, anu it`s the
inteilace Letween the two wheie the ¨mixing¨ occuis. Like HDFS, MapReuuce has
Luilt-in ieliaLility.
This, in a nutshell, is what Hauoop pioviues: a ieliaLle shaieu stoiage anu analysis
system. The stoiage is pioviueu Ly HDFS anu analysis Ly MapReuuce. Theie aie othei
paits to Hauoop, Lut these capaLilities aie its keinel.
Comparison with Other Systems
The appioach taken Ly MapReuuce may seem like a Liute-loice appioach. The piemise
is that the entiie uataset÷oi at least a goou poition ol it÷is piocesseu loi each gueiy.
But this is its powei. MapReuuce is a batch gueiy piocessoi, anu the aLility to iun an
au hoc gueiy against youi whole uataset anu get the iesults in a ieasonaLle time is
tiansloimative. It changes the way you think aLout uata, anu unlocks uata that was
pieviously aichiveu on tape oi uisk. It gives people the oppoitunity to innovate with
uata. Questions that took too long to get answeieu Leloie can now Le answeieu, which
in tuin leaus to new guestions anu new insights.
Foi example, Mailtiust, Rackspace`s mail uivision, useu Hauoop loi piocessing email
logs. One au hoc gueiy they wiote was to linu the geogiaphic uistiiLution ol theii useis.
In theii woius:
This uata was so uselul that we`ve scheuuleu the MapReuuce joL to iun monthly anu we
will Le using this uata to help us ueciue which Rackspace uata centeis to place new mail
seiveis in as we giow.
By Liinging seveial hunuieu gigaLytes ol uata togethei anu having the tools to analyze
it, the Rackspace engineeis weie aLle to gain an unueistanuing ol the uata that they
otheiwise woulu nevei have hau, anu, luitheimoie, they weie aLle to use what they
hau leaineu to impiove the seivice loi theii customeis. You can ieau moie aLout how
Rackspace uses Hauoop in Chaptei 16.
Vhy can`t we use uataLases with lots ol uisks to uo laige-scale Latch analysis? Vhy is
MapReuuce neeueu?
4 | Chapter 1: Meet Hadoop
The answei to these guestions comes liom anothei tienu in uisk uiives: seek time is
impioving moie slowly than tianslei iate. Seeking is the piocess ol moving the uisk`s
heau to a paiticulai place on the uisk to ieau oi wiite uata. It chaiacteiizes the latency
ol a uisk opeiation, wheieas the tianslei iate coiiesponus to a uisk`s Lanuwiuth.
Il the uata access pattein is uominateu Ly seeks, it will take longei to ieau oi wiite laige
poitions ol the uataset than stieaming thiough it, which opeiates at the tianslei iate.
On the othei hanu, loi upuating a small piopoition ol iecoius in a uataLase, a tiaui-
tional B-Tiee (the uata stiuctuie useu in ielational uataLases, which is limiteu Ly the
iate it can peiloim seeks) woiks well. Foi upuating the majoiity ol a uataLase, a B-Tiee
is less ellicient than MapReuuce, which uses Soit/Meige to ieLuilu the uataLase.
In many ways, MapReuuce can Le seen as a complement to an RDBMS. (The uilleiences
Letween the two systems aie shown in TaLle 1-1.) MapReuuce is a goou lit loi pioLlems
that neeu to analyze the whole uataset, in a Latch lashion, paiticulaily loi au hoc anal-
ysis. An RDBMS is goou loi point gueiies oi upuates, wheie the uataset has Leen in-
uexeu to uelivei low-latency ietiieval anu upuate times ol a ielatively small amount ol
uata. MapReuuce suits applications wheie the uata is wiitten once, anu ieau many
times, wheieas a ielational uataLase is goou loi uatasets that aie continually upuateu.
Tab|c 1-1. RDBMS conparcd to MapRcducc
Traditional RDBMS MapReduce
Data size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many times Write once, read many times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear
Anothei uilleience Letween MapReuuce anu an RDBMS is the amount ol stiuctuie in
the uatasets that they opeiate on. Structurcd data is uata that is oiganizeu into entities
that have a uelineu loimat, such as XML uocuments oi uataLase taLles that conloim
to a paiticulai pieuelineu schema. This is the iealm ol the RDBMS. Scni-structurcd
data, on the othei hanu, is loosei, anu though theie may Le a schema, it is olten ignoieu,
so it may Le useu only as a guiue to the stiuctuie ol the uata: loi example, a spieausheet,
in which the stiuctuie is the giiu ol cells, although the cells themselves may holu any
loim ol uata. Unstructurcd data uoes not have any paiticulai inteinal stiuctuie: loi
example, plain text oi image uata. MapReuuce woiks well on unstiuctuieu oi semi-
stiuctuieu uata, since it is uesigneu to inteipiet the uata at piocessing time. In othei
woius, the input keys anu values loi MapReuuce aie not an intiinsic piopeity ol the
uata, Lut they aie chosen Ly the peison analyzing the uata.
Comparison with Other Systems | 5
Relational uata is olten norna|izcd to ietain its integiity anu iemove ieuunuancy.
Noimalization poses pioLlems loi MapReuuce, since it makes ieauing a iecoiu a non-
local opeiation, anu one ol the cential assumptions that MapReuuce makes is that it
is possiLle to peiloim (high-speeu) stieaming ieaus anu wiites.
A weL seivei log is a goou example ol a set ol iecoius that is not noimalizeu (loi ex-
ample, the client hostnames aie specilieu in lull each time, even though the same client
may appeai many times), anu this is one ieason that logliles ol all kinus aie paiticulaily
well-suiteu to analysis with MapReuuce.
MapReuuce is a lineaily scalaLle piogiamming mouel. The piogiammei wiites two
lunctions÷a map lunction anu a ieuuce lunction÷each ol which uelines a mapping
liom one set ol key-value paiis to anothei. These lunctions aie oLlivious to the size ol
the uata oi the clustei that they aie opeiating on, so they can Le useu unchangeu loi a
small uataset anu loi a massive one. Moie impoitant, il you uouLle the size ol the input
uata, a joL will iun twice as slow. But il you also uouLle the size ol the clustei, a joL
will iun as last as the oiiginal one. This is not geneially tiue ol SQL gueiies.
Ovei time, howevei, the uilleiences Letween ielational uataLases anu MapReuuce sys-
tems aie likely to Llui÷Loth as ielational uataLases stait incoipoiating some ol the
iueas liom MapReuuce (such as Astei Data`s anu Gieenplum`s uataLases) anu, liom
the othei uiiection, as highei-level gueiy languages Luilt on MapReuuce (such as Pig
anu Hive) make MapReuuce systems moie appioachaLle to tiauitional uataLase
Grid Computing
The High Peiloimance Computing (HPC) anu Giiu Computing communities have
Leen uoing laige-scale uata piocessing loi yeais, using such APIs as Message Passing
Inteilace (MPI). Bioauly, the appioach in HPC is to uistiiLute the woik acioss a clustei
ol machines, which access a shaieu lilesystem, hosteu Ly a SAN. This woiks well loi
pieuominantly compute-intensive joLs, Lut Lecomes a pioLlem when noues neeu to
access laigei uata volumes (hunuieus ol gigaLytes, the point at which MapReuuce ieally
staits to shine), since the netwoik Lanuwiuth is the Lottleneck anu compute noues
Lecome iule.
5. In ]anuaiy 2007, Daviu ]. DeVitt anu Michael StoneLiakei causeu a stii Ly puLlishing ¨MapReuuce: A
majoi step Lackwaius¨ (http://databascco|unn.vcrtica.con/databasc-innovation/naprcducc-a-najor-stcp
-bac|wards), in which they ciiticizeu MapReuuce loi Leing a pooi suLstitute loi ielational uataLases.
Many commentatois aigueu that it was a lalse compaiison (see, loi example, Maik C. Chu-Caiioll`s
¨DataLases aie hammeis; MapReuuce is a sciewuiivei,¨ http://scicnccb|ogs.con/goodnath/2008/01/
databascs_arc_hanncrs_naprcduc.php), anu DeVitt anu StoneLiakei lolloweu up with ¨MapReuuce
II¨ (http://databascco|unn.vcrtica.con/databasc-innovation/naprcducc-ii), wheie they auuiesseu the
main topics Liought up Ly otheis.
6 | Chapter 1: Meet Hadoop
MapReuuce tiies to collocate the uata with the compute noue, so uata access is last
since it is local.
This leatuie, known as data |oca|ity, is at the heait ol MapReuuce anu
is the ieason loi its goou peiloimance. Recognizing that netwoik Lanuwiuth is the most
piecious iesouice in a uata centei enviionment (it is easy to satuiate netwoik links Ly
copying uata aiounu), MapReuuce implementations go to gieat lengths to conseive it
Ly explicitly mouelling netwoik topology. Notice that this aiiangement uoes not pie-
cluue high-CPU analyses in MapReuuce.
MPI gives gieat contiol to the piogiammei, Lut ieguiies that he oi she explicitly hanule
the mechanics ol the uata llow, exposeu via low-level C ioutines anu constiucts, such
as sockets, as well as the highei-level algoiithm loi the analysis. MapReuuce opeiates
only at the highei level: the piogiammei thinks in teims ol lunctions ol key anu value
paiis, anu the uata llow is implicit.
Cooiuinating the piocesses in a laige-scale uistiiLuteu computation is a challenge. The
haiuest aspect is giacelully hanuling paitial lailuie÷when you uon`t know il a iemote
piocess has laileu oi not÷anu still making piogiess with the oveiall computation.
MapReuuce spaies the piogiammei liom having to think aLout lailuie, since the
implementation uetects laileu map oi ieuuce tasks anu iescheuules ieplacements on
machines that aie healthy. MapReuuce is aLle to uo this since it is a sharcd-nothing
aichitectuie, meaning that tasks have no uepenuence on one othei. (This is a slight
oveisimplilication, since the output liom mappeis is leu to the ieuuceis, Lut this is
unuei the contiol ol the MapReuuce system; in this case, it neeus to take moie caie
ieiunning a laileu ieuucei than ieiunning a laileu map, since it has to make suie it can
ietiieve the necessaiy map outputs, anu il not, iegeneiate them Ly iunning the ielevant
maps again.) So liom the piogiammei`s point ol view, the oiuei in which the tasks iun
uoesn`t mattei. By contiast, MPI piogiams have to explicitly manage theii own check-
pointing anu iecoveiy, which gives moie contiol to the piogiammei, Lut makes them
moie uillicult to wiite.
MapReuuce might sounu like guite a iestiictive piogiamming mouel, anu in a sense it
is: you aie limiteu to key anu value types that aie ielateu in specilieu ways, anu mappeis
anu ieuuceis iun with veiy limiteu cooiuination Letween one anothei (the mappeis
pass keys anu values to ieuuceis). A natuial guestion to ask is: can you uo anything
uselul oi nontiivial with it?
The answei is yes. MapReuuce was inventeu Ly engineeis at Google as a system loi
Luiluing piouuction seaich inuexes Lecause they lounu themselves solving the same
pioLlem ovei anu ovei again (anu MapReuuce was inspiieu Ly oluei iueas liom the
lunctional piogiamming, uistiiLuteu computing, anu uataLase communities), Lut it
has since Leen useu loi many othei applications in many othei inuustiies. It is pleasantly
suipiising to see the iange ol algoiithms that can Le expiesseu in MapReuuce, liom
6. ]im Giay was an eaily auvocate ol putting the computation neai the uata. See ¨DistiiLuteu Computing
Economics,¨ Maich 2003, http://rcscarch.nicrosojt.con/apps/pubs/dcjau|t.aspx?id=70001.
Comparison with Other Systems | 7
image analysis, to giaph-Laseu pioLlems, to machine leaining algoiithms.
It can`t
solve eveiy pioLlem, ol couise, Lut it is a geneial uata-piocessing tool.
You can see a sample ol some ol the applications that Hauoop has Leen useu loi in
Chaptei 16.
Volunteer Computing
Vhen people liist heai aLout Hauoop anu MapReuuce, they olten ask, ¨How is it
uilleient liom SETI¿home?¨ SETI, the Seaich loi Extia-Teiiestiial Intelligence, iuns
a pioject calleu SETI¿home in which volunteeis uonate CPU time liom theii otheiwise
iule computeis to analyze iauio telescope uata loi signs ol intelligent lile outsiue eaith.
SETI¿home is the most well-known ol many vo|untccr conputing piojects; otheis in-
cluue the Gieat Inteinet Meisenne Piime Seaich (to seaich loi laige piime numLeis)
anu Foluing¿home (to unueistanu piotein loluing anu how it ielates to uisease).
Volunteei computing piojects woik Ly Lieaking the pioLlem they aie tiying to
solve into chunks calleu wor| units, which aie sent to computeis aiounu the woilu to
Le analyzeu. Foi example, a SETI¿home woik unit is aLout 0.35 MB ol iauio telescope
uata, anu takes houis oi uays to analyze on a typical home computei. Vhen the analysis
is completeu, the iesults aie sent Lack to the seivei, anu the client gets anothei woik
unit. As a piecaution to comLat cheating, each woik unit is sent to thiee uilleient
machines anu neeus at least two iesults to agiee to Le accepteu.
Although SETI¿home may Le supeilicially similai to MapReuuce (Lieaking a pioLlem
into inuepenuent pieces to Le woikeu on in paiallel), theie aie some signilicant uillei-
ences. The SETI¿home pioLlem is veiy CPU-intensive, which makes it suitaLle loi
iunning on hunuieus ol thousanus ol computeis acioss the woilu,
since the time to
tianslei the woik unit is uwaileu Ly the time to iun the computation on it. Volunteeis
aie uonating CPU cycles, not Lanuwiuth.
MapReuuce is uesigneu to iun joLs that last minutes oi houis on tiusteu, ueuicateu
haiuwaie iunning in a single uata centei with veiy high aggiegate Lanuwiuth intei-
connects. By contiast, SETI¿home iuns a peipetual computation on untiusteu
machines on the Inteinet with highly vaiiaLle connection speeus anu no uata locality.
7. Apache Mahout (http://nahout.apachc.org/) is a pioject to Luilu machine leaining liLiaiies (such as
classilication anu clusteiing algoiithms) that iun on Hauoop.
S. In ]anuaiy 200S, SETI¿home was iepoiteu at http://www.p|anctary.org/prograns/projccts/sctiathonc/
sctiathonc_20080115.htn| to Le piocessing 300 gigaLytes a uay, using 320,000 computeis (most ol which
aie not ueuicateu to SETI¿home; they aie useu loi othei things, too).
8 | Chapter 1: Meet Hadoop
A Brief History of Hadoop
Hauoop was cieateu Ly Doug Cutting, the cieatoi ol Apache Lucene, the wiuely useu
text seaich liLiaiy. Hauoop has its oiigins in Apache Nutch, an open souice weL seaich
engine, itsell a pait ol the Lucene pioject.
The Origin of the Name “Hadoop”
The name Hauoop is not an acionym; it`s a maue-up name. The pioject`s cieatoi, Doug
Cutting, explains how the name came aLout:
The name my kiu gave a stulleu yellow elephant. Shoit, ielatively easy to spell anu
pionounce, meaningless, anu not useu elsewheie: those aie my naming ciiteiia.
Kius aie goou at geneiating such. Googol is a kiu`s teim.
SuLpiojects anu ¨contiiL¨ mouules in Hauoop also tenu to have names that aie unie-
lateu to theii lunction, olten with an elephant oi othei animal theme (¨Pig,¨ loi
example). Smallei components aie given moie uesciiptive (anu theieloie moie mun-
uane) names. This is a goou piinciple, as it means you can geneially woik out what
something uoes liom its name. Foi example, the joLtiackei
keeps tiack ol MapReuuce
Builuing a weL seaich engine liom sciatch was an amLitious goal, loi not only is the
soltwaie ieguiieu to ciawl anu inuex weLsites complex to wiite, Lut it is also a challenge
to iun without a ueuicateu opeiations team, since theie aie so many moving paits. It`s
expensive, too: Mike Calaiella anu Doug Cutting estimateu a system suppoiting a
1-Lillion-page inuex woulu cost aiounu hall a million uollais in haiuwaie, with a
monthly iunning cost ol $30,000.
Neveitheless, they Lelieveu it was a woithy goal,
as it woulu open up anu ultimately uemociatize seaich engine algoiithms.
Nutch was staiteu in 2002, anu a woiking ciawlei anu seaich system guickly emeigeu.
Howevei, they iealizeu that theii aichitectuie woulun`t scale to the Lillions ol pages on
the VeL. Help was at hanu with the puLlication ol a papei in 2003 that uesciiLeu the
aichitectuie ol Google`s uistiiLuteu lilesystem, calleu GFS, which was Leing useu in
piouuction at Google.
GFS, oi something like it, woulu solve theii stoiage neeus loi
the veiy laige liles geneiateu as a pait ol the weL ciawl anu inuexing piocess. In pai-
ticulai, GFS woulu liee up time Leing spent on auministiative tasks such as managing
stoiage noues. In 200+, they set aLout wiiting an open souice implementation, the
Nutch DistiiLuteu Filesystem (NDFS).
9. In this Look, we use the loweicase loim, ¨joLtiackei,¨ to uenote the entity when it`s Leing ieleiieu
to geneially, anu the CamelCase loim JobTracker to uenote the ]ava class that implements it.
10. Mike Calaiella anu Doug Cutting, ¨Builuing Nutch: Open Souice Seaich,¨ ACM Qucuc, Apiil 200+, http:
11. Sanjay Ghemawat, Howaiu GoLioll, anu Shun-Tak Leung, ¨The Google File System,¨ OctoLei 2003,
A Brief History of Hadoop | 9
In 200+, Google puLlisheu the papei that intiouuceu MapReuuce to the woilu.
in 2005, the Nutch uevelopeis hau a woiking MapReuuce implementation in Nutch,
anu Ly the miuule ol that yeai all the majoi Nutch algoiithms hau Leen poiteu to iun
using MapReuuce anu NDFS.
NDFS anu the MapReuuce implementation in Nutch weie applicaLle Leyonu the iealm
ol seaich, anu in FeLiuaiy 2006 they moveu out ol Nutch to loim an inuepenuent
suLpioject ol Lucene calleu Hauoop. At aiounu the same time, Doug Cutting joineu
Yahoo!, which pioviueu a ueuicateu team anu the iesouices to tuin Hauoop into a
system that ian at weL scale (see siueLai). This was uemonstiateu in FeLiuaiy 200S
when Yahoo! announceu that its piouuction seaich inuex was Leing geneiateu Ly a
10,000-coie Hauoop clustei.
In ]anuaiy 200S, Hauoop was maue its own top-level pioject at Apache, conliiming its
success anu its uiveise, active community. By this time, Hauoop was Leing useu Ly
many othei companies Lesiues Yahoo!, such as Last.lm, FaceLook, anu the Ncw Yor|
Tincs. Some applications aie coveieu in the case stuuies in Chaptei 16 anu on the
Hauoop wiki.
In one well-puLlicizeu leat, the Ncw Yor| Tincs useu Amazon`s EC2 compute clouu
to ciunch thiough loui teiaLytes ol scanneu aichives liom the papei conveiting them
to PDFs loi the VeL.
The piocessing took less than 2+ houis to iun using 100 ma-
chines, anu the pioject pioLaLly woulun`t have Leen emLaikeu on without the com-
Lination ol Amazon`s pay-Ly-the-houi mouel (which alloweu the NYT to access a laige
numLei ol machines loi a shoit peiiou) anu Hauoop`s easy-to-use paiallel piogiam-
ming mouel.
In Apiil 200S, Hauoop Lioke a woilu iecoiu to Lecome the lastest system to soit a
teiaLyte ol uata. Running on a 910-noue clustei, Hauoop soiteu one teiaLyte in 209
seconus (just unuei 3¹ minutes), Leating the pievious yeai`s winnei ol 297 seconus
(uesciiLeu in uetail in ¨TeiaByte Soit on Apache Hauoop¨ on page 601). In NovemLei
ol the same yeai, Google iepoiteu that its MapReuuce implementation soiteu one tei-
aLyte in 6S seconus.
As the liist euition ol this Look was going to piess (May 2009),
it was announceu that a team at Yahoo! useu Hauoop to soit one teiaLyte in 62 seconus.
12. ]elliey Dean anu Sanjay Ghemawat, ¨MapReuuce: Simplilieu Data Piocessing on Laige Clusteis ,¨
DecemLei 200+, http://|abs.goog|c.con/papcrs/naprcducc.htn|.
13. ¨Yahoo! Launches Voilu`s Laigest Hauoop Piouuction Application,¨ 19 FeLiuaiy 200S, http://dcvc|opcr
1+. Deiek Gottliiu, ¨Sell-seivice, Pioiateu Supei Computing Fun!¨ 1 NovemLei 2007, http://opcn.b|ogs
15. ¨Soiting 1PB with MapReuuce,¨ 21 NovemLei 200S, http://goog|cb|og.b|ogspot.con/2008/11/sorting-1pb
10 | Chapter 1: Meet Hadoop
Hadoop at Yahoo!
Builuing Inteinet-scale seaich engines ieguiies huge amounts ol uata anu theieloie
laige numLeis ol machines to piocess it. Yahoo! Seaich consists ol loui piimaiy com-
ponents: the Craw|cr, which uownloaus pages liom weL seiveis; the WcbMap, which
Luilus a giaph ol the known VeL; the |ndcxcr, which Luilus a ieveise inuex to the Lest
pages; anu the Runtinc, which answeis useis` gueiies. The VeLMap is a giaph that
consists ol ioughly 1 tiillion (10
) euges each iepiesenting a weL link anu 100 Lillion
) noues each iepiesenting uistinct URLs. Cieating anu analyzing such a laige giaph
ieguiies a laige numLei ol computeis iunning loi many uays. In eaily 2005, the inlia-
stiuctuie loi the VeLMap, nameu Drcadnaught, neeueu to Le ieuesigneu to scale up
to moie noues. Dieaunaught hau successlully scaleu liom 20 to 600 noues, Lut ieguiieu
a complete ieuesign to scale out luithei. Dieaunaught is similai to MapReuuce in many
ways, Lut pioviues moie llexiLility anu less stiuctuie. In paiticulai, each liagment in a
Dieaunaught joL can senu output to each ol the liagments in the next stage ol the joL,
Lut the soit was all uone in liLiaiy coue. In piactice, most ol the VeLMap phases weie
paiis that coiiesponueu to MapReuuce. Theieloie, the VeLMap applications woulu
not ieguiie extensive ielactoiing to lit into MapReuuce.
Eiic Balueschwielei (Eiic1+) cieateu a small team anu we staiteu uesigning anu
piototyping a new liamewoik wiitten in C-- moueleu altei GFS anu MapReuuce to
ieplace Dieaunaught. Although the immeuiate neeu was loi a new liamewoik loi
VeLMap, it was cleai that stanuaiuization ol the Latch platloim acioss Yahoo! Seaich
was ciitical anu Ly making the liamewoik geneial enough to suppoit othei useis, we
coulu Lettei leveiage investment in the new platloim.
At the same time, we weie watching Hauoop, which was pait ol Nutch, anu its piogiess.
In ]anuaiy 2006, Yahoo! hiieu Doug Cutting, anu a month latei we ueciueu to aLanuon
oui piototype anu auopt Hauoop. The auvantage ol Hauoop ovei oui piototype anu
uesign was that it was alieauy woiking with a ieal application (Nutch) on 20 noues.
That alloweu us to Liing up a ieseaich clustei two months latei anu stait helping ieal
customeis use the new liamewoik much soonei than we coulu have otheiwise. Anothei
auvantage, ol couise, was that since Hauoop was alieauy open souice, it was easiei
(although lai liom easy!) to get peimission liom Yahoo!`s legal uepaitment to woik in
open souice. So we set up a 200-noue clustei loi the ieseaicheis in eaily 2006 anu put
the VeLMap conveision plans on holu while we suppoiteu anu impioveu Hauoop loi
the ieseaich useis.
Heie`s a guick timeline ol how things have piogiesseu:
º 200+÷Initial veisions ol what is now Hauoop DistiiLuteu Filesystem anu Map-
Reuuce implementeu Ly Doug Cutting anu Mike Calaiella.
º DecemLei 2005÷Nutch poiteu to the new liamewoik. Hauoop iuns ieliaLly on
20 noues.
º ]anuaiy 2006÷Doug Cutting joins Yahoo!.
º FeLiuaiy 2006÷Apache Hauoop pioject ollicially staiteu to suppoit the stanu-
alone uevelopment ol MapReuuce anu HDFS.
A Brief History of Hadoop | 11
º FeLiuaiy 2006÷Auoption ol Hauoop Ly Yahoo! Giiu team.
º Apiil 2006÷Soit Lenchmaik (10 GB/noue) iun on 1SS noues in +7.9 houis.
º May 2006÷Yahoo! set up a Hauoop ieseaich clustei÷300 noues.
º May 2006÷Soit Lenchmaik iun on 500 noues in +2 houis (Lettei haiuwaie than
Apiil Lenchmaik).
º OctoLei 2006÷Reseaich clustei ieaches 600 noues.
º DecemLei 2006÷Soit Lenchmaik iun on 20 noues in 1.S houis, 100 noues in 3.3
houis, 500 noues in 5.2 houis, 900 noues in 7.S houis.
º ]anuaiy 2007÷Reseaich clustei ieaches 900 noues.
º Apiil 2007÷Reseaich clusteis÷2 clusteis ol 1000 noues.
º Apiil 200S÷Von the 1 teiaLyte soit Lenchmaik in 209 seconus on 900 noues.
º OctoLei 200S÷Loauing 10 teiaLytes ol uata pei uay on to ieseaich clusteis.
º Maich 2009÷17 clusteis with a total ol 2+,000 noues.
º Apiil 2009÷Von the minute soit Ly soiting 500 GB in 59 seconus (on 1,+00
noues) anu the 100 teiaLyte soit in 173 minutes (on 3,+00 noues).
÷Owen O`Malley
Apache Hadoop and the Hadoop Ecosystem
Although Hauoop is Lest known loi MapReuuce anu its uistiiLuteu lilesystem (HDFS,
ienameu liom NDFS), the teim is also useu loi a lamily ol ielateu piojects that lall
unuei the umLiella ol inliastiuctuie loi uistiiLuteu computing anu laige-scale uata
All ol the coie piojects coveieu in this Look aie hosteu Ly the Apache Soltwaie Foun-
uation, which pioviues suppoit loi a community ol open souice soltwaie piojects,
incluuing the oiiginal HTTP Seivei liom which it gets its name. As the Hauoop eco-
system giows, moie piojects aie appeaiing, not necessaiily hosteu at Apache, which
pioviue complementaiy seivices to Hauoop, oi Luilu on the coie to auu highei-level
The Hauoop piojects that aie coveieu in this Look aie uesciiLeu Liielly heie:
A set ol components anu inteilaces loi uistiiLuteu lilesystems anu geneial I/O
(seiialization, ]ava RPC, peisistent uata stiuctuies).
A seiialization system loi ellicient, cioss-language RPC, anu peisistent uata
A uistiiLuteu uata piocessing mouel anu execution enviionment that iuns on laige
clusteis ol commouity machines.
12 | Chapter 1: Meet Hadoop
A uistiiLuteu lilesystem that iuns on laige clusteis ol commouity machines.
A uata llow language anu execution enviionment loi exploiing veiy laige uatasets.
Pig iuns on HDFS anu MapReuuce clusteis.
A uistiiLuteu uata waiehouse. Hive manages uata stoieu in HDFS anu pioviues a
gueiy language Laseu on SQL (anu which is tianslateu Ly the iuntime engine to
MapReuuce joLs) loi gueiying the uata.
A uistiiLuteu, column-oiienteu uataLase. HBase uses HDFS loi its unueilying
stoiage, anu suppoits Loth Latch-style computations using MapReuuce anu point
gueiies (ianuom ieaus).
A uistiiLuteu, highly availaLle cooiuination seivice. ZooKeepei pioviues piimitives
such as uistiiLuteu locks that can Le useu loi Luiluing uistiiLuteu applications.
A tool loi elliciently moving uata Letween ielational uataLases anu HDFS.
Hadoop Releases
Vhich veision ol Hauoop shoulu you use? The answei to this guestion changes ovei
time, ol couise, anu also uepenus on the leatuies that you neeu. ¨Hauoop Relea-
ses¨ on page 13 summaiizes the high-level leatuies in iecent Hauoop ielease seiies.
Theie aie a lew active ielease seiies. The 1.x ielease seiies is a continuation ol the 0.20
ielease seiies, anu contains the most staLle veisions ol Hauoop cuiiently availaLle. This
seiies incluues secuie KeiLeios authentication, which pievents unauthoiizeu access to
Hauoop uata (see ¨Secuiity¨ on page 323). Almost all piouuction clusteis use these
ieleases, oi ueiiveu veisions (such as commeicial uistiiLutions).
The 0.22 anu 0.23 ielease seiies
aie cuiiently maikeu as alpha ieleases (as ol eaily
2012), Lut this is likely to change Ly the time you ieau this as they get moie ieal-woilu
testing anu Lecome moie staLle (consult the Apache Hauoop ieleases page loi the latest
status). 0.23 incluues seveial majoi new leatuies:
º A new MapReuuce iuntime, calleu MapReuuce 2, implementeu on a new system
calleu YARN (Yet Anothei Resouice Negotiatoi), which is a geneial iesouice man-
agement system loi iunning uistiiLuteu applications. MapReuuce 2 ieplaces the
16. The numLeiing will Le upuateu to iellect the lact that they aie latei veisions than 1.x luithei into theii
ielease cycles.
Hadoop Releases | 13
¨classic¨ iuntime in pievious ieleases. It is uesciiLeu in moie uepth in ¨YARN
(MapReuuce 2)¨ on page 19+.
º HDFS leueiation, which paititions the HDFS namespace acioss multiple namen-
oues to suppoit clusteis with veiy laige numLeis ol liles. See ¨HDFS Feueia-
tion¨ on page +9.
º HDFS high-availaLility, which iemoves the namenoue as a single point ol lailuie
Ly suppoiting stanuLy namenoues loi lailovei. See ¨HDFS High-AvailaLil-
ity¨ on page 50.
Tab|c 1-2. Icaturcs Supportcd by Hadoop Rc|casc Scrics
Feature 1.x 0.22 0.23
Secure authentication Yes No Yes
Old configuration names Yes Deprecated Deprecated
New configuration names No Yes Yes
Old MapReduce API Yes Deprecated Deprecated
New MapReduce API Partial Yes Yes
MapReduce 1 runtime (Classic) Yes Yes No
MapReduce 2 runtime (YARN) No No Yes
HDFS federation No No Yes
HDFS high-availability No No Planned
TaLle 1-2 only coveis leatuies in HDFS anu MapReuuce. Othei piojects in the Hauoop
ecosystem aie continually evolving too, anu picking a comLination ol components that
woik well togethei can Le a challenge. Thanklully, you uon`t have to uo this woik
youisell. The Apache Bigtop pioject (http://incubator.apachc.org/bigtop/) iuns inteio-
peiaLility tests on stacks ol Hauoop components, anu pioviues Linaiy packages (RPMs
anu DeLian packages) loi easy installation. Theie aie also commeicial venuois olleiing
Hauoop uistiiLutions containing suites ol compatiLle components.
What’s Covered in this Book
This Look coveis all the ieleases in TaLle 1-2. In the cases wheie a leatuie is only
availaLle in a paiticulai ielease, it is noteu in the text.
The coue in this Look is wiitten to woik against all these ielease seiies, except in a small
numLei ol cases, which aie explicitly calleu out. The example coue availaLle on the
weLsite has a list ol the veisions that it was testeu against.
Configuration Names
Conliguiation piopeity names have Leen changeu in the ieleases altei 1.x, in oiuei to
give them a moie iegulai naming stiuctuie. Foi example, the HDFS piopeities pei-
14 | Chapter 1: Meet Hadoop
taining to the namenoue have Leen changeu to have a dfs.namenode pielix, so
dfs.name.dir has changeu to dfs.namenode.name.dir. Similaily, MapReuuce piopeities
have the mapreduce pielix, iathei than the oluei mapred pielix, so mapred.job.name has
changeu to mapreduce.job.name.
Foi piopeities that exist in veision 1.x, the olu (uepiecateu) names aie useu in this
Look, since they will woik in all the veisions ol Hauoop listeu heie. Il you aie using a
ielease altei 1.x, you may wish to use the new piopeity names in youi conliguiation
liles anu coue to iemove uepiecation wainings. A taLle listing the uepiecateu piopeities
names anu theii ieplacements can Le lounu on the Hauoop weLsite at http://hadoop
MapReduce APIs
Hauoop pioviues two ]ava MapReuuce APIs, uesciiLeu in moie uetail in ¨The olu anu
the new ]ava MapReuuce APIs¨ on page 27. This euition ol the Look uses the new
API, which will woik with all veisions listeu heie, except in a lew cases wheie that pait
ol the new API is not availaLle in the 1.x ieleases. In these cases the eguivalent coue
using the olu API is availaLle on the Look`s weLsite.
Vhen moving liom one ielease to anothei you neeu to consiuei the upgiaue steps that
aie neeueu. Theie aie seveial aspects to consiuei: API compatiLility, uata compatiLility,
anu wiie compatiLility.
API compatiLility conceins the contiact Letween usei coue anu the puLlisheu Hauoop
APIs, such as the ]ava MapReuuce APIs. Majoi ieleases (e.g. liom 1.x.y to 2.0.0) aie
alloweu to Lieak API compatiLility, so usei piogiams may neeu to Le mouilieu anu
iecompileu. Minoi ieleases (e.g. liom 1.0.x to 1.1.0) anu point ieleases (e.g. liom 1.0.1
to 1.0.2) shoulu not Lieak compatiLility.
Hauoop uses a classilication scheme loi API elements to uenote theii
staLility. The aLove iules loi API compatiLility covei those elements
that aie maikeu InterfaceStability.Stable. Some elements ol the puL-
lic Hauoop APIs, howevei, aie maikeu with the InterfaceStabil
ity.Evolving oi InterfaceStability.Unstable annotations (all these an-
notations aie in the org.apache.hadoop.classification package), which
means they aie alloweu to Lieak compatiLility on minoi anu point ie-
leases, iespectively.
17. Pie-1.0 ieleases lollow the iules loi majoi ieleases, so a change in veision liom 0.1.0 to 0.2.0 (say)
constitutes a majoi ielease, anu theieloie may Lieak API compatiLility.
Hadoop Releases | 15
Data compatiLility conceins peisistent uata anu metauata loimats, such as the loimat
in which the HDFS namenoue stoies its peisistent uata. The loimats can change acioss
minoi oi majoi ieleases, Lut the change is tianspaient to useis since the upgiaue will
automatically migiate the uata. Theie may Le some iestiictions aLout upgiaue paths,
anu these aie coveieu in the ielease notes÷loi example it may Le necessaiy to upgiaue
via an inteimeuiate ielease iathei than upgiauing uiiectly to the latei linal ielease in
one step. Hauoop upgiaues aie uiscusseu in moie uetail in ¨Upgiaues¨ on page 360.
Viie compatiLility conceins the inteiopeiaLility Letween clients anu seiveis via wiie
piotocols like RPC anu HTTP. Theie aie two types ol client: exteinal clients (iun Ly
useis) anu inteinal clients (iun on the clustei as a pait ol the system, e.g. uatanoue anu
tasktiackei uaemons). In geneial, inteinal clients have to Le upgiaueu in lockstep÷an
oluei veision ol a tasktiackei will not woik with a newei joLtiackei, loi example. In
the lutuie iolling upgiaues may Le suppoiteu, which woulu allow clustei uaemons to
Le upgiaueu in phases, so that the clustei woulu still Le availaLle to exteinal clients
uuiing the upgiaue.
Foi exteinal clients that aie iun Ly the usei÷like a piogiam that ieaus oi wiites liom
HDFS, oi the MapReuuce joL suLmission client÷the client must have the same majoi
ielease numLei as the seivei, Lut is alloweu to have a lowei minoi oi point ielease
numLei (e.g. client veision 1.0.1 will woik with seivei 1.0.2 oi 1.1.0, Lut not with seivei
2.0.0). Any exception to this iule shoulu Le calleu out in the ielease notes.
16 | Chapter 1: Meet Hadoop
MapReuuce is a piogiamming mouel loi uata piocessing. The mouel is simple, yet not
too simple to expiess uselul piogiams in. Hauoop can iun MapReuuce piogiams wiit-
ten in vaiious languages; in this chaptei, we shall look at the same piogiam expiesseu
in ]ava, RuLy, Python, anu C--. Most impoitant, MapReuuce piogiams aie inheiently
paiallel, thus putting veiy laige-scale uata analysis into the hanus ol anyone with
enough machines at theii uisposal. MapReuuce comes into its own loi laige uatasets,
so let`s stait Ly looking at one.
A Weather Dataset
Foi oui example, we will wiite a piogiam that mines weathei uata. Veathei sensois
collecting uata eveiy houi at many locations acioss the gloLe gathei a laige volume ol
log uata, which is a goou canuiuate loi analysis with MapReuuce, since it is semi-
stiuctuieu anu iecoiu-oiienteu.
Data Format
The uata we will use is liom the National Climatic Data Centei (NCDC, http://www
.ncdc.noaa.gov/). The uata is stoieu using a line-oiienteu ASCII loimat, in which each
line is a iecoiu. The loimat suppoits a iich set ol meteoiological elements, many ol
which aie optional oi with vaiiaLle uata lengths. Foi simplicity, we shall locus on the
Lasic elements, such as tempeiatuie, which aie always piesent anu aie ol lixeu wiuth.
Example 2-1 shows a sample line with some ol the salient lielus highlighteu. The line
has Leen split into multiple lines to show each lielu: in the ieal lile, lielus aie packeu
into one line with no uelimiteis.
Exanp|c 2-1. Iornat oj a Nationa| C|inatc Data Ccntcr rccord
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
+0171 # elevation (meters)
320 # wind direction (degrees)
1 # quality code
00450 # sky ceiling height (meters)
1 # quality code
010000 # visibility distance (meters)
1 # quality code
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code
Data liles aie oiganizeu Ly uate anu weathei station. Theie is a uiiectoiy loi each yeai
liom 1901 to 2001, each containing a gzippeu lile loi each weathei station with its
ieauings loi that yeai. Foi example, heie aie the liist entiies loi 1990:
% ls raw/1990 | head
Since theie aie tens ol thousanus ol weathei stations, the whole uataset is maue up ol
a laige numLei ol ielatively small liles. It`s geneially easiei anu moie ellicient to piocess
a smallei numLei ol ielatively laige liles, so the uata was piepiocesseu so that each
18 | Chapter 2: MapReduce
yeai`s ieauings weie concatenateu into a single lile. (The means Ly which this was
caiiieu out is uesciiLeu in Appenuix C.)
Analyzing the Data with Unix Tools
Vhat`s the highest iecoiueu gloLal tempeiatuie loi each yeai in the uataset? Ve will
answei this liist without using Hauoop, as this inloimation will pioviue a peiloimance
Laseline, as well as a uselul means to check oui iesults.
The classic tool loi piocessing line-oiienteu uata is aw|. Example 2-2 is a small sciipt
to calculate the maximum tempeiatuie loi each yeai.
Exanp|c 2-2. A progran jor jinding thc naxinun rccordcd tcnpcraturc by ycar jron NCDC wcathcr
#!/usr/bin/env bash
for year in all/*
echo -ne `basename $year .gz`"\t"
gunzip -c $year | \
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
The sciipt loops thiough the compiesseu yeai liles, liist piinting the yeai, anu then
piocessing each lile using aw|. The aw| sciipt extiacts two lielus liom the uata: the aii
tempeiatuie anu the guality coue. The aii tempeiatuie value is tuineu into an integei
Ly auuing 0. Next, a test is applieu to see il the tempeiatuie is valiu (the value 9999
signilies a missing value in the NCDC uataset) anu il the guality coue inuicates that the
ieauing is not suspect oi eiioneous. Il the ieauing is OK, the value is compaieu with
the maximum value seen so lai, which is upuateu il a new maximum is lounu. The
END Llock is executeu altei all the lines in the lile have Leen piocesseu, anu it piints the
maximum value.
Heie is the Leginning ol a iun:
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
The tempeiatuie values in the souice lile aie scaleu Ly a lactoi ol 10, so this woiks out
as a maximum tempeiatuie ol 31.7C loi 1901 (theie weie veiy lew ieauings at the
Leginning ol the centuiy, so this is plausiLle). The complete iun loi the centuiy took
+2 minutes in one iun on a single EC2 High-CPU Extia Laige Instance.
Analyzing the Data with Unix Tools | 19
To speeu up the piocessing, we neeu to iun paits ol the piogiam in paiallel. In theoiy,
this is stiaightloiwaiu: we coulu piocess uilleient yeais in uilleient piocesses, using all
the availaLle haiuwaie thieaus on a machine. Theie aie a lew pioLlems with this,
Fiist, uiviuing the woik into egual-size pieces isn`t always easy oi oLvious. In this case,
the lile size loi uilleient yeais vaiies wiuely, so some piocesses will linish much eailiei
than otheis. Even il they pick up luithei woik, the whole iun is uominateu Ly the
longest lile. A Lettei appioach, although one that ieguiies moie woik, is to split the
input into lixeu-size chunks anu assign each chunk to a piocess.
Seconu, comLining the iesults liom inuepenuent piocesses may neeu luithei piocess-
ing. In this case, the iesult loi each yeai is inuepenuent ol othei yeais anu may Le
comLineu Ly concatenating all the iesults, anu soiting Ly yeai. Il using the lixeu-size
chunk appioach, the comLination is moie uelicate. Foi this example, uata loi a pai-
ticulai yeai will typically Le split into seveial chunks, each piocesseu inuepenuently.
Ve`ll enu up with the maximum tempeiatuie loi each chunk, so the linal step is to
look loi the highest ol these maximums, loi each yeai.
Thiiu, you aie still limiteu Ly the piocessing capacity ol a single machine. Il the Lest
time you can achieve is 20 minutes with the numLei ol piocessois you have, then that`s
it. You can`t make it go lastei. Also, some uatasets giow Leyonu the capacity ol a single
machine. Vhen we stait using multiple machines, a whole host ol othei lactois come
into play, mainly lalling in the categoiy ol cooiuination anu ieliaLility. Vho iuns the
oveiall joL? How uo we ueal with laileu piocesses?
So, though it`s leasiLle to paiallelize the piocessing, in piactice it`s messy. Using a
liamewoik like Hauoop to take caie ol these issues is a gieat help.
Analyzing the Data with Hadoop
To take auvantage ol the paiallel piocessing that Hauoop pioviues, we neeu to expiess
oui gueiy as a MapReuuce joL. Altei some local, small-scale testing, we will Le aLle to
iun it on a clustei ol machines.
Map and Reduce
MapReuuce woiks Ly Lieaking the piocessing into two phases: the map phase anu the
ieuuce phase. Each phase has key-value paiis as input anu output, the types ol which
may Le chosen Ly the piogiammei. The piogiammei also specilies two lunctions: the
map lunction anu the ieuuce lunction.
The input to oui map phase is the iaw NCDC uata. Ve choose a text input loimat that
gives us each line in the uataset as a text value. The key is the ollset ol the Leginning
ol the line liom the Leginning ol the lile, Lut as we have no neeu loi this, we ignoie it.
20 | Chapter 2: MapReduce
Oui map lunction is simple. Ve pull out the yeai anu the aii tempeiatuie, since these
aie the only lielus we aie inteiesteu in. In this case, the map lunction is just a uata
piepaiation phase, setting up the uata in such a way that the ieuucei lunction can uo
its woik on it: linuing the maximum tempeiatuie loi each yeai. The map lunction is
also a goou place to uiop Lau iecoius: heie we liltei out tempeiatuies that aie missing,
suspect, oi eiioneous.
To visualize the way the map woiks, consiuei the lollowing sample lines ol input uata
(some unuseu columns have Leen uioppeu to lit the page, inuicateu Ly ellipses):
These lines aie piesenteu to the map lunction as the key-value paiis:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
The keys aie the line ollsets within the lile, which we ignoie in oui map lunction. The
map lunction meiely extiacts the yeai anu the aii tempeiatuie (inuicateu in Lolu text),
anu emits them as its output (the tempeiatuie values have Leen inteipieteu as
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output liom the map lunction is piocesseu Ly the MapReuuce liamewoik Leloie
Leing sent to the ieuuce lunction. This piocessing soits anu gioups the key-value paiis
Ly key. So, continuing the example, oui ieuuce lunction sees the lollowing input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each yeai appeais with a list ol all its aii tempeiatuie ieauings. All the ieuuce lunction
has to uo now is iteiate thiough the list anu pick up the maximum ieauing:
(1949, 111)
(1950, 22)
This is the linal output: the maximum gloLal tempeiatuie iecoiueu in each yeai.
The whole uata llow is illustiateu in Figuie 2-1. At the Lottom ol the uiagiam is a Unix
pipeline, which mimics the whole MapReuuce llow, anu which we will see again latei
in the chaptei when we look at Hauoop Stieaming.
Analyzing the Data with Hadoop | 21
Java MapReduce
Having iun thiough how the MapReuuce piogiam woiks, the next step is to expiess it
in coue. Ve neeu thiee things: a map lunction, a ieuuce lunction, anu some coue to
iun the joL. The map lunction is iepiesenteu Ly the Mapper class, which ueclaies an
aLstiact map() methou. Example 2-3 shows the implementation ol oui map methou.
Exanp|c 2-3. Mappcr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
The Mapper class is a geneiic type, with loui loimal type paiameteis that specily the
input key, input value, output key, anu output value types ol the map lunction. Foi the
piesent example, the input key is a long integei ollset, the input value is a line ol text,
Iigurc 2-1. MapRcducc |ogica| data j|ow
22 | Chapter 2: MapReduce
the output key is a yeai, anu the output value is an aii tempeiatuie (an integei). Rathei
than use Luilt-in ]ava types, Hauoop pioviues its own set ol Lasic types that aie opti-
mizeu loi netwoik seiialization. These aie lounu in the org.apache.hadoop.io package.
Heie we use LongWritable, which coiiesponus to a ]ava Long, Text (like ]ava String),
anu IntWritable (like ]ava Integer).
The map() methou is passeu a key anu a value. Ve conveit the Text value containing
the line ol input into a ]ava String, then use its substring() methou to extiact the
columns we aie inteiesteu in.
The map() methou also pioviues an instance ol Context to wiite the output to. In this
case, we wiite the yeai as a Text oLject (since we aie just using it as a key), anu the
tempeiatuie is wiappeu in an IntWritable. Ve wiite an output iecoiu only il the tem-
peiatuie is piesent anu the guality coue inuicates the tempeiatuie ieauing is OK.
The ieuuce lunction is similaily uelineu using a Reducer, as illustiateu in Example 2-+.
Exanp|c 2-1. Rcduccr jor naxinun tcnpcraturc cxanp|c
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));
Again, loui loimal type paiameteis aie useu to specily the input anu output types, this
time loi the ieuuce lunction. The input types ol the ieuuce lunction must match the
output types ol the map lunction: Text anu IntWritable. Anu in this case, the output
types ol the ieuuce lunction aie Text anu IntWritable, loi a yeai anu its maximum
tempeiatuie, which we linu Ly iteiating thiough the tempeiatuies anu compaiing each
with a iecoiu ol the highest lounu so lai.
The thiiu piece ol coue iuns the MapReuuce joL (see Example 2-5).
Analyzing the Data with Hadoop | 23
Exanp|c 2-5. App|ication to jind thc naxinun tcnpcraturc in thc wcathcr datasct
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");

Job job = new Job();
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


System.exit(job.waitForCompletion(true) ? 0 : 1);
A Job oLject loims the specilication ol the joL. It gives you contiol ovei how the joL is
iun. Vhen we iun this joL on a Hauoop clustei, we will package the coue into a ]AR
lile (which Hauoop will uistiiLute aiounu the clustei). Rathei than explicitly specily
the name ol the ]AR lile, we can pass a class in the Job`s setJarByClass() methou, which
Hauoop will use to locate the ielevant ]AR lile Ly looking loi the ]AR lile containing
this class.
Having constiucteu a Job oLject, we specily the input anu output paths. An input path
is specilieu Ly calling the static addInputPath() methou on FileInputFormat, anu it can
Le a single lile, a uiiectoiy (in which case, the input loims all the liles in that uiiectoiy),
oi a lile pattein. As the name suggests, addInputPath() can Le calleu moie than once
to use input liom multiple paths.
The output path (ol which theie is only one) is specilieu Ly the static setOutput
Path() methou on FileOutputFormat. It specilies a uiiectoiy wheie the output liles liom
the ieuucei lunctions aie wiitten. The uiiectoiy shoulun`t exist Leloie iunning the joL,
as Hauoop will complain anu not iun the joL. This piecaution is to pievent uata loss
24 | Chapter 2: MapReduce
(it can Le veiy annoying to acciuentally oveiwiite the output ol a long joL with
Next, we specily the map anu ieuuce types to use via the setMapperClass() anu
setReducerClass() methous.
The setOutputKeyClass() anu setOutputValueClass() methous contiol the output types
loi the map anu the ieuuce lunctions, which aie olten the same, as they aie in oui case.
Il they aie uilleient, then the map output types can Le set using the methous
setMapOutputKeyClass() anu setMapOutputValueClass().
The input types aie contiolleu via the input loimat, which we have not explicitly set
since we aie using the uelault TextInputFormat.
Altei setting the classes that ueline the map anu ieuuce lunctions, we aie ieauy to iun
the joL. The waitForCompletion() methou on Job suLmits the joL anu waits loi it to
linish. The methou`s Loolean aigument is a veiLose llag, so in this case the joL wiites
inloimation aLout its piogiess to the console.
The ietuin value ol the waitForCompletion() methou is a Loolean inuicating success
(true) oi lailuie (false), which we tianslate into the piogiam`s exit coue ol 0 oi 1.
A test run
Altei wiiting a MapReuuce joL, it`s noimal to tiy it out on a small uataset to llush out
any immeuiate pioLlems with the coue. Fiist install Hauoop in stanualone moue÷
theie aie instiuctions loi how to uo this in Appenuix A. This is the moue in which
Hauoop iuns using the local lilesystem with a local joL iunnei. Then install anu compile
the examples using the instiuctions on the Look`s weLsite.
Let`s test it on the live-line sample uiscusseu eailiei (the output has Leen slightly ie-
loimatteu to lit the page):
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop MaxTemperature input/ncdc/sample.txt output
11/09/15 21:35:14 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobT
racker, sessionId=
11/09/15 21:35:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library fo
r your platform... using builtin-java classes where applicable
11/09/15 21:35:14 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing t
he arguments. Applications should implement Tool for the same.
11/09/15 21:35:14 INFO input.FileInputFormat: Total input paths to process : 1
11/09/15 21:35:14 WARN snappy.LoadSnappy: Snappy native library not loaded
11/09/15 21:35:14 INFO mapreduce.JobSubmitter: number of splits:1
11/09/15 21:35:15 INFO mapreduce.Job: Running job: job_local_0001
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Waiting for map tasks
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Starting task: attempt_local_0001_m_000
11/09/15 21:35:15 INFO mapred.Task: Using ResourceCalculatorPlugin : null
11/09/15 21:35:15 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
11/09/15 21:35:15 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
11/09/15 21:35:15 INFO mapred.MapTask: soft limit at 83886080
Analyzing the Data with Hadoop | 25
11/09/15 21:35:15 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
11/09/15 21:35:15 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 INFO mapred.MapTask: Starting flush of map output
11/09/15 21:35:15 INFO mapred.MapTask: Spilling map output
11/09/15 21:35:15 INFO mapred.MapTask: bufstart = 0; bufend = 45; bufvoid = 104857600
11/09/15 21:35:15 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 2621438
0(104857520); length = 17/6553600
11/09/15 21:35:15 INFO mapred.MapTask: Finished spill 0
11/09/15 21:35:15 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And i
s in the process of commiting
11/09/15 21:35:15 INFO mapred.LocalJobRunner: map
11/09/15 21:35:15 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Finishing task: attempt_local_0001_m_00
11/09/15 21:35:15 INFO mapred.LocalJobRunner: Map task executor complete.
11/09/15 21:35:15 INFO mapred.Task: Using ResourceCalculatorPlugin : null
11/09/15 21:35:15 INFO mapred.Merger: Merging 1 sorted segments
11/09/15 21:35:15 INFO mapred.Merger: Down to the last merge-pass, with 1 segments le
ft of total size: 50 bytes
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 WARN conf.Configuration: mapred.skip.on is deprecated. Instead, use
11/09/15 21:35:15 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And i
s in the process of commiting
11/09/15 21:35:15 INFO mapred.LocalJobRunner:
11/09/15 21:35:15 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to
commit now
11/09/15 21:35:15 INFO output.FileOutputCommitter: Saved output of task 'attempt_loca
l_0001_r_000000_0' to file:/Users/tom/workspace/hadoop-book/output
11/09/15 21:35:15 INFO mapred.LocalJobRunner: reduce > reduce
11/09/15 21:35:15 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.
11/09/15 21:35:16 INFO mapreduce.Job: map 100% reduce 100%
11/09/15 21:35:16 INFO mapreduce.Job: Job job_local_0001 completed successfully
11/09/15 21:35:16 INFO mapreduce.Job: Counters: 24
File System Counters
FILE: Number of bytes read=255967
FILE: Number of bytes written=397273
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=45
Map output materialized bytes=61
Input split bytes=124
Combine input records=0
Combine output records=0
Reduce input groups=2
Reduce shuffle bytes=0
Reduce input records=5
Reduce output records=2
Spilled Records=10
Shuffled Maps =0
26 | Chapter 2: MapReduce
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=10
Total committed heap usage (bytes)=379723776
File Input Format Counters
Bytes Read=529
File Output Format Counters
Bytes Written=29
Vhen the hadoop commanu is invokeu with a classname as the liist aigument, it
launches a ]VM to iun the class. It is moie convenient to use hadoop than stiaight
java since the loimei auus the Hauoop liLiaiies (anu theii uepenuencies) to the class-
path anu picks up the Hauoop conliguiation, too. To auu the application classes to the
classpath, we`ve uelineu an enviionment vaiiaLle calleu HADOOP_CLASSPATH, which the
hadoop sciipt picks up.
Vhen iunning in local (stanualone) moue, the piogiams in this Look
all assume that you have set the HADOOP_CLASSPATH in this way. The com-
manus shoulu Le iun liom the uiiectoiy that the example coue is
installeu in.
The output liom iunning the joL pioviues some uselul inloimation. Foi example,
we can see that the joL was given an ID ol job_local_0001, anu it ian one map task
anu one ieuuce task (with the IDs attempt_local_0001_m_000000_0 anu
attempt_local_0001_r_000000_0). Knowing the joL anu task IDs can Le veiy uselul when
ueLugging MapReuuce joLs.
The last section ol the output, titleu ¨Counteis,¨ shows the statistics that Hauoop
geneiates loi each joL it iuns. These aie veiy uselul loi checking whethei the amount
ol uata piocesseu is what you expecteu. Foi example, we can lollow the numLei ol
iecoius that went thiough the system: live map inputs piouuceu live map outputs, then
live ieuuce inputs in two gioups piouuceu two ieuuce outputs.
The output was wiitten to the output uiiectoiy, which contains one output lile pei
ieuucei. The joL hau a single ieuucei, so we linu a single lile, nameu part-r-00000:
% cat output/part-r-00000
1949 111
1950 22
This iesult is the same as when we went thiough it Ly hanu eailiei. Ve inteipiet this
as saying that the maximum tempeiatuie iecoiueu in 19+9 was 11.1C, anu in 1950 it
was 2.2C.
The old and the new Java MapReduce APIs
The ]ava MapReuuce API useu in the pievious section was liist ieleaseu in Hauoop
0.20.0. This new API, sometimes ieleiieu to as ¨Context OLjects,¨ was uesigneu to
Analyzing the Data with Hadoop | 27
make the API easiei to evolve in the lutuie. It is type-incompatiLle with the olu, how-
evei, so applications neeu to Le iewiitten to take auvantage ol it.
The new API is not complete in the 1.x (loimeily 0.20) ielease seiies, so the olu API is
iecommenueu loi these ieleases, uespite having Leen maikeu as uepiecateu in the eaily
0.20 ieleases. (UnueistanuaLly, this iecommenuation causeu a lot ol conlusion so the
uepiecation waining was iemoveu liom latei ieleases in that seiies.)
Pievious euitions ol this Look weie Laseu on 0.20 ieleases, anu useu the olu API
thioughout (although the new API was coveieu, the coue invaiiaLly useu the olu API).
In this euition the new API is useu as the piimaiy API, except wheie mentioneu. How-
evei, shoulu you wish to use the olu API, you can, since the coue loi all the examples
in this Look is availaLle loi the olu API on the Look`s weLsite.
Theie aie seveial notaLle uilleiences Letween the two APIs:
º The new API lavois aLstiact classes ovei inteilaces, since these aie easiei to evolve.
Foi example, you can auu a methou (with a uelault implementation) to an aLstiact
class without Lieaking olu implementations ol the class
. Foi example, the
Mapper anu Reducer inteilaces in the olu API aie aLstiact classes in the new API.
º The new API is in the org.apache.hadoop.mapreduce package (anu suLpackages).
The olu API can still Le lounu in org.apache.hadoop.mapred.
º The new API makes extensive use ol context oLjects that allow the usei coue to
communicate with the MapReuuce system. The new Context, loi example, essen-
tially unilies the iole ol the JobConf, the OutputCollector, anu the Reporter liom
the olu API.
º In Loth APIs, key-value iecoiu paiis aie pusheu to the mappei anu ieuucei, Lut in
auuition, the new API allows Loth mappeis anu ieuuceis to contiol the execution
llow Ly oveiiiuing the run() methou. Foi example, iecoius can Le piocesseu in
Latches, oi the execution can Le teiminateu Leloie all the iecoius have Leen pio-
cesseu. In the olu API this is possiLle loi mappeis Ly wiiting a MapRunnable, Lut no
eguivalent exists loi ieuuceis.
º Conliguiation has Leen unilieu. The olu API has a special JobConf oLject loi joL
conliguiation, which is an extension ol Hauoop`s vanilla Configuration oLject
(useu loi conliguiing uaemons; see ¨The Conliguiation API¨ on page 1+6). In the
new API, this uistinction is uioppeu, so joL conliguiation is uone thiough a
º ]oL contiol is peiloimeu thiough the Job class in the new API, iathei than the olu
JobClient, which no longei exists in the new API.
1. See TaLle 1-2 loi levels ol suppoit ol the olu anu new APIs Ly Hauoop ielease.
2. Technically, such a change woulu almost ceiainly Lieak implementations that alieauy ueline a methou
with the same signatuie as the new one, Lut as the aiticle at http://wi|i.cc|ipsc.org/Evo|ving_java-bascd
_AP|s=Exanp|c_1_-_Adding_an_AP|_ncthod explains, loi all piactical puiposes this is tieateu as a
compatiLle change.
28 | Chapter 2: MapReduce
º Output liles aie nameu slightly uilleiently: in the olu API Loth map anu ieuuce
outputs aie nameu part-nnnnn, while in the new API map outputs aie nameu part-
n-nnnnn, anu ieuuce outputs aie nameu part-r-nnnnn (wheie nnnnn is an integei
uesignating the pait numLei, staiting liom zeio).
º Usei-oveiiiuaLle methous in the new API aie ueclaieu to thiow java.lang.Inter
ruptedException. Vhat this means is that you can wiite youi coue to Le ieponsive
to inteiupts so that the liamewoik can giacelully cancel long-iunning opeiations
il it neeus to
º In the new API the reduce() methou passes values as a java.lang.Iterable, iathei
than a java.lang.Iterator (as the olu API uoes). This change makes it easiei to
iteiate ovei the values using ]ava`s loi-each loop constiuct:
for (VALUEIN value : values) { ... }
Example 2-6 shows the MaxTemperature application iewiitten to use the olu API. The
uilleiences aie highlighteu in Lolu.
Vhen conveiting youi Mapper anu Reducer classes to the new API, uon`t
loiget to change the signatuie ol the map() anu reduce() methous to the
new loim. ]ust changing youi class to extenu the new Mapper oi
Reducer classes will not piouuce a compilation eiioi oi waining, since
these classes pioviue an iuentity loim ol the map() oi reduce() methou
(iespectively). Youi mappei oi ieuucei coue, howevei, will not Le in-
vokeu, which can leau to some haiu-to-uiagnose eiiois.
Exanp|c 2-ó. App|ication to jind thc naxinun tcnpcraturc, using thc o|d MapRcducc AP|
public class OldMaxTemperature {

static class OldMaxTemperatureMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999;

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
3. ¨Dealing with InteiiupteuException¨ Ly Biian Goetz explains this technigue in uetail.
Analyzing the Data with Hadoop | 29
output.collect(new Text(year), new IntWritable(airTemperature));

static class OldMaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
output.collect(key, new IntWritable(maxValue));
public static void main(String[] args) throws IOException {
if (args.length != 2) {
System.err.println("Usage: OldMaxTemperature <input path> <output path>");

JobConf conf = new JobConf(MaxTemperatureWithCombiner.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

Scaling Out
You`ve seen how MapReuuce woiks loi small inputs; now it`s time to take a Liiu`s-eye
view ol the system anu look at the uata llow loi laige inputs. Foi simplicity, the
examples so lai have useu liles on the local lilesystem. Howevei, to scale out, we neeu
to stoie the uata in a uistiiLuteu lilesystem, typically HDFS (which you`ll leain aLout
in the next chaptei), to allow Hauoop to move the MapReuuce computation to each
machine hosting a pait ol the uata. Let`s see how this woiks.
30 | Chapter 2: MapReduce
Data Flow
Fiist, some teiminology. A MapReuuce job is a unit ol woik that the client wants to Le
peiloimeu: it consists ol the input uata, the MapReuuce piogiam, anu conliguiation
inloimation. Hauoop iuns the joL Ly uiviuing it into tas|s, ol which theie aie two types:
nap tas|s anu rcducc tas|s.
Theie aie two types ol noues that contiol the joL execution piocess: a jobtrac|cr anu
a numLei ol tas|trac|crs. The joLtiackei cooiuinates all the joLs iun on the system Ly
scheuuling tasks to iun on tasktiackeis. Tasktiackeis iun tasks anu senu piogiess
iepoits to the joLtiackei, which keeps a iecoiu ol the oveiall piogiess ol each joL. Il a
task lails, the joLtiackei can iescheuule it on a uilleient tasktiackei.
Hauoop uiviues the input to a MapReuuce joL into lixeu-size pieces calleu input
sp|its, oi just sp|its. Hauoop cieates one map task loi each split, which iuns the usei-
uelineu map lunction loi each rccord in the split.
Having many splits means the time taken to piocess each split is small compaieu to the
time to piocess the whole input. So il we aie piocessing the splits in paiallel, the pio-
cessing is Lettei loau-Lalanceu il the splits aie small, since a lastei machine will Le aLle
to piocess piopoitionally moie splits ovei the couise ol the joL than a slowei machine.
Even il the machines aie iuentical, laileu piocesses oi othei joLs iunning concuiiently
make loau Lalancing uesiiaLle, anu the guality ol the loau Lalancing incieases as the
splits Lecome moie line-giaineu.
On the othei hanu, il splits aie too small, then the oveiheau ol managing the splits anu
ol map task cieation Legins to uominate the total joL execution time. Foi most joLs, a
goou split size tenus to Le the size ol an HDFS Llock, 6+ MB Ly uelault, although this
can Le changeu loi the clustei (loi all newly cieateu liles), oi specilieu when each lile
is cieateu.
Hauoop uoes its Lest to iun the map task on a noue wheie the input uata iesiues in
HDFS. This is calleu the data |oca|ity optinization since it uoesn`t use valuaLle clustei
Lanuwiuth. Sometimes, howevei, all thiee noues hosting the HDFS Llock ieplicas loi
a map task`s input split aie iunning othei map tasks so the joL scheuulei will look loi
a liee map slot on a noue in the same iack as one ol the Llocks. Veiy occasionally even
this is not possiLle, so an oll-iack noue is useu, which iesults in an intei-iack netwoik
tianslei. The thiee possiLilities aie illustiateu in Figuie 2-2.
Iigurc 2-2. Data-|oca| (a), rac|-|oca| (b), and ojj-rac| (c) nap tas|s.
Scaling Out | 31
It shoulu now Le cleai why the optimal split size is the same as the Llock size: it is the
laigest size ol input that can Le guaianteeu to Le stoieu on a single noue. Il the split
spanneu two Llocks, it woulu Le unlikely that any HDFS noue stoieu Loth Llocks, so
some ol the split woulu have to Le tiansleiieu acioss the netwoik to the noue iunning
the map task, which is cleaily less ellicient than iunning the whole map task using local
Map tasks wiite theii output to the local uisk, not to HDFS. Vhy is this? Map output
is inteimeuiate output: it`s piocesseu Ly ieuuce tasks to piouuce the linal output, anu
once the joL is complete the map output can Le thiown away. So stoiing it in HDFS,
with ieplication, woulu Le oveikill. Il the noue iunning the map task lails Leloie the
map output has Leen consumeu Ly the ieuuce task, then Hauoop will automatically
ieiun the map task on anothei noue to ie-cieate the map output.
Reuuce tasks uon`t have the auvantage ol uata locality÷the input to a single ieuuce
task is noimally the output liom all mappeis. In the piesent example, we have a single
ieuuce task that is leu Ly all ol the map tasks. Theieloie, the soiteu map outputs have
to Le tiansleiieu acioss the netwoik to the noue wheie the ieuuce task is iunning, wheie
they aie meigeu anu then passeu to the usei-uelineu ieuuce lunction. The output ol
the ieuuce is noimally stoieu in HDFS loi ieliaLility. As explaineu in Chaptei 3, loi
each HDFS Llock ol the ieuuce output, the liist ieplica is stoieu on the local noue, with
othei ieplicas Leing stoieu on oll-iack noues. Thus, wiiting the ieuuce output uoes
consume netwoik Lanuwiuth, Lut only as much as a noimal HDFS wiite pipeline
The whole uata llow with a single ieuuce task is illustiateu in Figuie 2-3. The uotteu
Loxes inuicate noues, the light aiiows show uata tiansleis on a noue, anu the heavy
aiiows show uata tiansleis Letween noues.
32 | Chapter 2: MapReduce
Iigurc 2-3. MapRcducc data j|ow with a sing|c rcducc tas|
The numLei ol ieuuce tasks is not goveineu Ly the size ol the input, Lut is specilieu
inuepenuently. In ¨The Delault MapReuuce ]oL¨ on page 225, you will see how to
choose the numLei ol ieuuce tasks loi a given joL.
Vhen theie aie multiple ieuuceis, the map tasks partition theii output, each cieating
one paitition loi each ieuuce task. Theie can Le many keys (anu theii associateu values)
in each paitition, Lut the iecoius loi any given key aie all in a single paitition. The
paititioning can Le contiolleu Ly a usei-uelineu paititioning lunction, Lut noimally the
uelault paititionei÷which Luckets keys using a hash lunction÷woiks veiy well.
The uata llow loi the geneial case ol multiple ieuuce tasks is illustiateu in Figuie 2-+.
This uiagiam makes it cleai why the uata llow Letween map anu ieuuce tasks is collo-
guially known as ¨the shullle,¨ as each ieuuce task is leu Ly many map tasks. The
shullle is moie complicateu than this uiagiam suggests, anu tuning it can have a Lig
impact on joL execution time, as you will see in ¨Shullle anu Soit¨ on page 205.
Scaling Out | 33
Iigurc 2-1. MapRcducc data j|ow with nu|tip|c rcducc tas|s
Finally, it`s also possiLle to have zeio ieuuce tasks. This can Le appiopiiate when you
uon`t neeu the shullle since the piocessing can Le caiiieu out entiiely in paiallel (a lew
examples aie uiscusseu in ¨NLineInputFoimat¨ on page 2+5). In this case, the only
oll-noue uata tianslei is when the map tasks wiite to HDFS (see Figuie 2-5).
Combiner Functions
Many MapReuuce joLs aie limiteu Ly the Lanuwiuth availaLle on the clustei, so it pays
to minimize the uata tiansleiieu Letween map anu ieuuce tasks. Hauoop allows the
usei to specily a conbincr junction to Le iun on the map output÷the comLinei lunc-
tion`s output loims the input to the ieuuce lunction. Since the comLinei lunction is an
optimization, Hauoop uoes not pioviue a guaiantee ol how many times it will call it
loi a paiticulai map output iecoiu, il at all. In othei woius, calling the comLinei lunc-
tion zeio, one, oi many times shoulu piouuce the same output liom the ieuucei.
34 | Chapter 2: MapReduce
Iigurc 2-5. MapRcducc data j|ow with no rcducc tas|s
The contiact loi the comLinei lunction constiains the type ol lunction that may Le
useu. This is Lest illustiateu with an example. Suppose that loi the maximum tempei-
atuie example, ieauings loi the yeai 1950 weie piocesseu Ly two maps (Lecause they
weie in uilleient splits). Imagine the liist map piouuceu the output:
(1950, 0)
(1950, 20)
(1950, 10)
Anu the seconu piouuceu:
(1950, 25)
(1950, 15)
The ieuuce lunction woulu Le calleu with a list ol all the values:
(1950, [0, 20, 10, 25, 15])
with output:
(1950, 25)
since 25 is the maximum value in the list. Ve coulu use a comLinei lunction that, just
like the ieuuce lunction, linus the maximum tempeiatuie loi each map output. The
ieuuce woulu then Le calleu with:
(1950, [20, 25])
anu the ieuuce woulu piouuce the same output as Leloie. Moie succinctly, we may
expiess the lunction calls on the tempeiatuie values in this case as lollows:
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Scaling Out | 35
Not all lunctions possess this piopeity.
Foi example, il we weie calculating mean
tempeiatuies, then we coulun`t use the mean as oui comLinei lunction, since:
mean(0, 20, 10, 25, 15) = 14
mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
The comLinei lunction uoesn`t ieplace the ieuuce lunction. (How coulu it? The ieuuce
lunction is still neeueu to piocess iecoius with the same key liom uilleient maps.) But
it can help cut uown the amount ol uata shullleu Letween the maps anu the ieuuces,
anu loi this ieason alone it is always woith consiueiing whethei you can use a comLinei
lunction in youi MapReuuce joL.
Specifying a combiner function
Going Lack to the ]ava MapReuuce piogiam, the comLinei lunction is uelineu using
the Reducer class, anu loi this application, it is the same implementation as the ieuucei
lunction in MaxTemperatureReducer. The only change we neeu to make is to set the
comLinei class on the Job (see Example 2-7).
Exanp|c 2-7. App|ication to jind thc naxinun tcnpcraturc, using a conbincr junction jor cjjicicncy
public class MaxTemperatureWithCombiner {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCombiner <input path> " +
"<output path>");

Job job = new Job();
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


System.exit(job.waitForCompletion(true) ? 0 : 1);
+. Functions with this piopeity aie calleu connutativc anu associativc. They aie also sometimes ieleiieu to
as distributivc, such as in the papei ¨Data CuLe: A Relational Aggiegation Opeiatoi Geneializing Gioup-
By, Cioss-TaL, anu SuL-Totals,¨ Giay et al. (1995).
36 | Chapter 2: MapReduce
Running a Distributed MapReduce Job
The same piogiam will iun, without alteiation, on a lull uataset. This is the point ol
MapReuuce: it scales to the size ol youi uata anu the size ol youi haiuwaie. Heie`s one
uata point: on a 10-noue EC2 clustei iunning High-CPU Extia Laige Instances, the
piogiam took six minutes to iun.
Ve`ll go thiough the mechanics ol iunning piogiams on a clustei in Chaptei 5.
Hadoop Streaming
Hauoop pioviues an API to MapReuuce that allows you to wiite youi map anu ieuuce
lunctions in languages othei than ]ava. Hadoop Strcaning uses Unix stanuaiu stieams
as the inteilace Letween Hauoop anu youi piogiam, so you can use any language that
can ieau stanuaiu input anu wiite to stanuaiu output to wiite youi MapReuuce
Stieaming is natuially suiteu loi text piocessing (although, as ol veision 0.21.0, it can
hanule Linaiy stieams, too), anu when useu in text moue, it has a line-oiienteu view ol
uata. Map input uata is passeu ovei stanuaiu input to youi map lunction, which pio-
cesses it line Ly line anu wiites lines to stanuaiu output. A map output key-value paii
is wiitten as a single taL-uelimiteu line. Input to the ieuuce lunction is in the same
loimat÷a taL-sepaiateu key-value paii÷passeu ovei stanuaiu input. The ieuuce lunc-
tion ieaus lines liom stanuaiu input, which the liamewoik guaiantees aie soiteu Ly
key, anu wiites its iesults to stanuaiu output.
Let`s illustiate this Ly iewiiting oui MapReuuce piogiam loi linuing maximum tem-
peiatuies Ly yeai in Stieaming.
The map lunction can Le expiesseu in RuLy as shown in Example 2-S.
Exanp|c 2-8. Map junction jor naxinun tcnpcraturc in Ruby
#!/usr/bin/env ruby
STDIN.each_line do |line|
val = line
year, temp, q = val[15,4], val[87,5], val[92,1]
5. This is a lactoi ol seven lastei than the seiial iun on one machine using aw|. The main ieason it wasn`t
piopoitionately lastei is Lecause the input uata wasn`t evenly paititioneu. Foi convenience, the input
liles weie gzippeu Ly yeai, iesulting in laige liles loi latei yeais in the uataset, when the numLei ol weathei
iecoius was much highei.
Hadoop Streaming | 37
puts "#{year}\t#{temp}" if (temp != "+9999" && q =~ /[01459]/)
The piogiam iteiates ovei lines liom stanuaiu input Ly executing a Llock loi each line
liom STDIN (a gloLal constant ol type IO). The Llock pulls out the ielevant lielus liom
each input line, anu, il the tempeiatuie is valiu, wiites the yeai anu the tempeiatuie
sepaiateu Ly a taL chaiactei \t to stanuaiu output (using puts).
It`s woith uiawing out a uesign uilleience Letween Stieaming anu the
]ava MapReuuce API. The ]ava API is geaieu towaiu piocessing youi
map lunction one iecoiu at a time. The liamewoik calls the map()
methou on youi Mapper loi each iecoiu in the input, wheieas with
Stieaming the map piogiam can ueciue how to piocess the input÷loi
example, it coulu easily ieau anu piocess multiple lines at a time since
it`s in contiol ol the ieauing. The usei`s ]ava map implementation is
¨pusheu¨ iecoius, Lut it`s still possiLle to consiuei multiple lines at a
time Ly accumulating pievious lines in an instance vaiiaLle in the
In this case, you neeu to implement the close() methou so that
you know when the last iecoiu has Leen ieau, so you can linish pio-
cessing the last gioup ol lines.
Since the sciipt just opeiates on stanuaiu input anu output, it`s tiivial to test the sciipt
without using Hauoop, simply using Unix pipes:
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb
1950 +0000
1950 +0022
1950 -0011
1949 +0111
1949 +0078
The ieuuce lunction shown in Example 2-9 is a little moie complex.
Exanp|c 2-9. Rcducc junction jor naxinun tcnpcraturc in Ruby
#!/usr/bin/env ruby
last_key, max_val = nil, 0
STDIN.each_line do |line|
key, val = line.split("\t")
if last_key && last_key != key
puts "#{last_key}\t#{max_val}"
last_key, max_val = key, val.to_i
last_key, max_val = key, [max_val, val.to_i].max
puts "#{last_key}\t#{max_val}" if last_key
6. Alteinatively, you coulu use ¨pull¨ style piocessing in the new MapReuuce API÷see ¨The
olu anu the new ]ava MapReuuce APIs¨ on page 27.
38 | Chapter 2: MapReduce
Again, the piogiam iteiates ovei lines liom stanuaiu input, Lut this time we have to
stoie some state as we piocess each key gioup. In this case, the keys aie weathei station
iuentilieis, anu we stoie the last key seen anu the maximum tempeiatuie seen so lai
loi that key. The MapReuuce liamewoik ensuies that the keys aie oiueieu, so we know
that il a key is uilleient liom the pievious one, we have moveu into a new key gioup.
In contiast to the ]ava API, wheie you aie pioviueu an iteiatoi ovei each key gioup, in
Stieaming you have to linu key gioup Lounuaiies in youi piogiam.
Foi each line, we pull out the key anu value, then il we`ve just linisheu a gioup (last_key
&& last_key != key), we wiite the key anu the maximum tempeiatuie loi that gioup,
sepaiateu Ly a taL chaiactei, Leloie iesetting the maximum tempeiatuie loi the new
key. Il we haven`t just linisheu a gioup, we just upuate the maximum tempeiatuie loi
the cuiient key.
The last line ol the piogiam ensuies that a line is wiitten loi the last key gioup in the
Ve can now simulate the whole MapReuuce pipeline with a Unix pipeline (which is
eguivalent to the Unix pipeline shown in Figuie 2-1):
% cat input/ncdc/sample.txt | ch02/src/main/ruby/max_temperature_map.rb | \
sort | ch02/src/main/ruby/max_temperature_reduce.rb
1949 111
1950 22
The output is the same as the ]ava piogiam, so the next step is to iun it using Hauoop
The hadoop commanu uoesn`t suppoit a Stieaming option; insteau, you specily the
Stieaming ]AR lile along with the jar option. Options to the Stieaming piogiam specily
the input anu output paths, anu the map anu ieuuce sciipts. This is what it looks like:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper ch02/src/main/ruby/max_temperature_map.rb \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb
Vhen iunning on a laige uataset on a clustei, we shoulu set the comLinei, using the
-combiner option.
Fiom ielease 0.21.0, the comLinei can Le any Stieaming commanu. Foi eailiei ieleases,
the comLinei hau to Le wiitten in ]ava, so as a woikaiounu it was common to uo manual
comLining in the mappei, without having to iesoit to ]ava. In this case, we coulu change
the mappei to Le a pipeline:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/all \
-output output \
-mapper "ch02/src/main/ruby/max_temperature_map.rb | sort |
ch02/src/main/ruby/max_temperature_reduce.rb" \
-reducer ch02/src/main/ruby/max_temperature_reduce.rb \
Hadoop Streaming | 39
-file ch02/src/main/ruby/max_temperature_map.rb \
-file ch02/src/main/ruby/max_temperature_reduce.rb
Note also the use ol -file, which we use when iunning Stieaming piogiams on the
clustei to ship the sciipts to the clustei.
Stieaming suppoits any piogiamming language that can ieau liom stanuaiu input, anu
wiite to stanuaiu output, so loi ieaueis moie lamiliai with Python, heie`s the same
example again.
The map sciipt is in Example 2-10, anu the ieuuce sciipt is in Exam-
ple 2-11.
Exanp|c 2-10. Map junction jor naxinun tcnpcraturc in Python
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], val[87:92], val[92:93])
if (temp != "+9999" and re.match("[01459]", q)):
print "%s\t%s" % (year, temp)
Exanp|c 2-11. Rcducc junction jor naxinun tcnpcraturc in Python
#!/usr/bin/env python
import sys
(last_key, max_val) = (None, 0)
for line in sys.stdin:
(key, val) = line.strip().split("\t")
if last_key and last_key != key:
print "%s\t%s" % (last_key, max_val)
(last_key, max_val) = (key, int(val))
(last_key, max_val) = (key, max(max_val, int(val)))
if last_key:
print "%s\t%s" % (last_key, max_val)
7. As an alteinative to Stieaming, Python piogiammeis shoulu consiuei DumLo (http://www.|ast.jn/
dunbo), which makes the Stieaming MapReuuce inteilace moie Pythonic anu easiei to use.
40 | Chapter 2: MapReduce
Ve can test the piogiams anu iun the joL in the same way we uiu in RuLy. Foi example,
to iun a test:
% cat input/ncdc/sample.txt | ch02/src/main/python/max_temperature_map.py | \
sort | ch02/src/main/python/max_temperature_reduce.py
1949 111
1950 22
Hadoop Pipes
Hauoop Pipes is the name ol the C-- inteilace to Hauoop MapReuuce. Unlike Stieam-
ing, which uses stanuaiu input anu output to communicate with the map anu ieuuce
coue, Pipes uses sockets as the channel ovei which the tasktiackei communicates with
the piocess iunning the C-- map oi ieuuce lunction. ]NI is not useu.
Ve`ll iewiite the example iunning thiough the chaptei in C--, anu then we`ll see how
to iun it using Pipes. Example 2-12 shows the souice coue loi the map anu ieuuce
lunctions in C--.
Exanp|c 2-12. Maxinun tcnpcraturc in C--
#include <algorithm>
#include <limits>
#include <stdint.h>
#include <string>
#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"
class MaxTemperatureMapper : public HadoopPipes::Mapper {
MaxTemperatureMapper(HadoopPipes::TaskContext& context) {
void map(HadoopPipes::MapContext& context) {
std::string line = context.getInputValue();
std::string year = line.substr(15, 4);
std::string airTemperature = line.substr(87, 5);
std::string q = line.substr(92, 1);
if (airTemperature != "+9999" &&
(q == "0" || q == "1" || q == "4" || q == "5" || q == "9")) {
context.emit(year, airTemperature);
class MapTemperatureReducer : public HadoopPipes::Reducer {
MapTemperatureReducer(HadoopPipes::TaskContext& context) {
void reduce(HadoopPipes::ReduceContext& context) {
int maxValue = INT_MIN;
while (context.nextValue()) {
Hadoop Pipes | 41
maxValue = std::max(maxValue, HadoopUtils::toInt(context.getInputValue()));
context.emit(context.getInputKey(), HadoopUtils::toString(maxValue));
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<MaxTemperatureMapper,
The application links against the Hauoop C-- liLiaiy, which is a thin wiappei loi
communicating with the tasktiackei chilu piocess. The map anu ieuuce lunctions aie
uelineu Ly extenuing the Mapper anu Reducer classes uelineu in the HadoopPipes name-
space anu pioviuing implementations ol the map() anu reduce() methous in each case.
These methous take a context oLject (ol type MapContext oi ReduceContext), which
pioviues the means loi ieauing input anu wiiting output, as well as accessing joL con-
liguiation inloimation via the JobConf class. The piocessing in this example is veiy
similai to the ]ava eguivalent.
Unlike the ]ava inteilace, keys anu values in the C-- inteilace aie Lyte Lulleis, iepie-
senteu as Stanuaiu Template LiLiaiy (STL) stiings. This makes the inteilace simplei,
although it uoes put a slightly gieatei Luiuen on the application uevelopei, who has to
conveit to anu liom iichei uomain-level types. This is eviuent in MapTempera
tureReducer wheie we have to conveit the input value into an integei (using a conve-
nience methou in HadoopUtils) anu then the maximum value Lack into a stiing Leloie
it`s wiitten out. In some cases, we can save on uoing the conveision, such as in MaxTem
peratureMapper wheie the airTemperature value is nevei conveiteu to an integei since
it is nevei piocesseu as a numLei in the map() methou.
The main() methou is the application entiy point. It calls HadoopPipes::runTask, which
connects to the ]ava paient piocess anu maishals uata to anu liom the Mapper oi
Reducer. The runTask() methou is passeu a Factory so that it can cieate instances ol the
Mapper oi Reducer. Vhich one it cieates is contiolleu Ly the ]ava paient ovei the socket
connection. Theie aie oveiloaueu template lactoiy methous loi setting a comLinei,
paititionei, iecoiu ieauei, oi iecoiu wiitei.
Compiling and Running
Now we can compile anu link oui piogiam using the Makelile in Example 2-13.
Exanp|c 2-13. Ma|cji|c jor C-- MapRcducc progran
CC = g++
max_temperature: max_temperature.cpp
$(CC) $(CPPFLAGS) $< -Wall -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes \
-lhadooputils -lpthread -g -O2 -o $@
42 | Chapter 2: MapReduce
The Makelile expects a couple ol enviionment vaiiaLles to Le set. Apait liom
HADOOP_INSTALL (which you shoulu alieauy have set il you lolloweu the installation
instiuctions in Appenuix A), you neeu to ueline PLATFORM, which specilies the opeiating
system, aichitectuie, anu uata mouel (e.g., 32- oi 6+-Lit). I ian it on a 32-Lit Linux
system with the lollowing:
% export PLATFORM=Linux-i386-32
% make
On successlul completion, you`ll linu the max_temperature executaLle in the cuiient
To iun a Pipes joL, we neeu to iun Hauoop in pscudo-distributcd moue (wheie all the
uaemons iun on the local machine), loi which theie aie setup instiuctions in Appen-
uix A. Pipes uoesn`t iun in stanualone (local) moue, since it ielies on Hauoop`s
uistiiLuteu cache mechanism, which woiks only when HDFS is iunning.
Vith the Hauoop uaemons now iunning, the liist step is to copy the executaLle to
HDFS so that it can Le pickeu up Ly tasktiackeis when they launch map anu ieuuce
% hadoop fs -put max_temperature bin/max_temperature
The sample uata also neeus to Le copieu liom the local lilesystem into HDFS:
% hadoop fs -put input/ncdc/sample.txt sample.txt
Now we can iun the joL. Foi this, we use the Hauoop pipes commanu, passing the URI
ol the executaLle in HDFS using the -program aigument:
% hadoop pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
-input sample.txt \
-output output \
-program bin/max_temperature
Ve specily two piopeities using the -D option: hadoop.pipes.java.recordreader anu
hadoop.pipes.java.recordwriter, setting Loth to true to say that we have not specilieu
a C-- iecoiu ieauei oi wiitei, Lut that we want to use the uelault ]ava ones (which aie
loi text input anu output). Pipes also allows you to set a ]ava mappei, ieuucei,
comLinei, oi paititionei. In lact, you can have a mixtuie ol ]ava oi C-- classes within
any one joL.
The iesult is the same as the othei veisions ol the same piogiam that we ian.
Hadoop Pipes | 43
The Hadoop Distributed Filesystem
Vhen a uataset outgiows the stoiage capacity ol a single physical machine, it Lecomes
necessaiy to paitition it acioss a numLei ol sepaiate machines. Filesystems that manage
the stoiage acioss a netwoik ol machines aie calleu distributcd ji|csystcns. Since they
aie netwoik-Laseu, all the complications ol netwoik piogiamming kick in, thus making
uistiiLuteu lilesystems moie complex than iegulai uisk lilesystems. Foi example, one
ol the Liggest challenges is making the lilesystem toleiate noue lailuie without sulleiing
uata loss.
Hauoop comes with a uistiiLuteu lilesystem calleu HDFS, which stanus loi Hadoop
Distributcd Ii|csystcn. (You may sometimes see ieleiences to ¨DFS¨÷inloimally oi in
oluei uocumentation oi conliguiations÷which is the same thing.) HDFS is Hauoop`s
llagship lilesystem anu is the locus ol this chaptei, Lut Hauoop actually has a geneial-
puipose lilesystem aLstiaction, so we`ll see along the way how Hauoop integiates with
othei stoiage systems (such as the local lilesystem anu Amazon S3).
The Design of HDFS
HDFS is a lilesystem uesigneu loi stoiing veiy laige liles with stieaming uata access
patteins, iunning on clusteis ol commouity haiuwaie.
Let`s examine this statement
in moie uetail:
\cry |argc ji|cs
¨Veiy laige¨ in this context means liles that aie hunuieus ol megaLytes, gigaLytes,
oi teiaLytes in size. Theie aie Hauoop clusteis iunning touay that stoie petaLytes
ol uata.
1. The aichitectuie ol HDFS is uesciiLeu in ¨The Hauoop DistiiLuteu File System¨ Ly Konstantin Shvachko,
Haiiong Kuang, Sanjay Rauia, anu RoLeit Chanslei (Pioceeuings ol MSST2010, May 2010, http://
2. ¨Scaling Hauoop to +000 noues at Yahoo!,¨ http://dcvc|opcr.yahoo.nct/b|ogs/hadoop/2008/09/sca|ing
Strcaning data acccss
HDFS is Luilt aiounu the iuea that the most ellicient uata piocessing pattein is a
wiite-once, ieau-many-times pattein. A uataset is typically geneiateu oi copieu
liom souice, then vaiious analyses aie peiloimeu on that uataset ovei time. Each
analysis will involve a laige piopoition, il not all, ol the uataset, so the time to ieau
the whole uataset is moie impoitant than the latency in ieauing the liist iecoiu.
Connodity hardwarc
Hauoop uoesn`t ieguiie expensive, highly ieliaLle haiuwaie to iun on. It`s uesigneu
to iun on clusteis ol commouity haiuwaie (commonly availaLle haiuwaie availaLle
liom multiple venuois
) loi which the chance ol noue lailuie acioss the clustei is
high, at least loi laige clusteis. HDFS is uesigneu to caiiy on woiking without a
noticeaLle inteiiuption to the usei in the lace ol such lailuie.
It is also woith examining the applications loi which using HDFS uoes not woik so
well. Vhile this may change in the lutuie, these aie aieas wheie HDFS is not a goou lit
Low-|atcncy data acccss
Applications that ieguiie low-latency access to uata, in the tens ol milliseconus
iange, will not woik well with HDFS. RememLei, HDFS is optimizeu loi ueliveiing
a high thioughput ol uata, anu this may Le at the expense ol latency. HBase
(Chaptei 13) is cuiiently a Lettei choice loi low-latency access.
Lots oj sna|| ji|cs
Since the namenoue holus lilesystem metauata in memoiy, the limit to the numLei
ol liles in a lilesystem is goveineu Ly the amount ol memoiy on the namenoue. As
a iule ol thumL, each lile, uiiectoiy, anu Llock takes aLout 150 Lytes. So, loi
example, il you hau one million liles, each taking one Llock, you woulu neeu at
least 300 MB ol memoiy. Vhile stoiing millions ol liles is leasiLle, Lillions is Le-
yonu the capaLility ol cuiient haiuwaie.
Mu|tip|c writcrs, arbitrary ji|c nodijications
Files in HDFS may Le wiitten to Ly a single wiitei. Viites aie always maue at the
enu ol the lile. Theie is no suppoit loi multiple wiiteis, oi loi mouilications at
aiLitiaiy ollsets in the lile. (These might Le suppoiteu in the lutuie, Lut they aie
likely to Le ielatively inellicient.)
3. See Chaptei 9 loi a typical machine specilication.
+. Foi an in-uepth exposition ol the scalaLility limits ol HDFS, see Konstantin V. Shvachko`s ¨ScalaLility
ol the Hauoop DistiiLuteu File System,¨ (http://dcvc|opcr.yahoo.nct/b|ogs/hadoop/2010/05/sca|abi|ity_oj
_thc_hadoop_dist.htn|) anu the companion papei ¨HDFS ScalaLility: The limits to giowth,¨ (Apiil 2010,
pp. 6÷16. http://www.uscnix.org/pub|ications/|ogin/2010-01/opcnpdjs/shvach|o.pdj) Ly the same authoi.
46 | Chapter 3: The Hadoop Distributed Filesystem
HDFS Concepts
A uisk has a Llock size, which is the minimum amount ol uata that it can ieau oi wiite.
Filesystems loi a single uisk Luilu on this Ly uealing with uata in Llocks, which aie an
integial multiple ol the uisk Llock size. Filesystem Llocks aie typically a lew kiloLytes
in size, while uisk Llocks aie noimally 512 Lytes. This is geneially tianspaient to the
lilesystem usei who is simply ieauing oi wiiting a lile÷ol whatevei length. Howevei,
theie aie tools to peiloim lilesystem maintenance, such as dj anu jsc|, that opeiate on
the lilesystem Llock level.
HDFS, too, has the concept ol a Llock, Lut it is a much laigei unit÷6+ MB Ly uelault.
Like in a lilesystem loi a single uisk, liles in HDFS aie Lioken into Llock-sizeu chunks,
which aie stoieu as inuepenuent units. Unlike a lilesystem loi a single uisk, a lile in
HDFS that is smallei than a single Llock uoes not occupy a lull Llock`s woith ol un-
ueilying stoiage. Vhen ungualilieu, the teim ¨Llock¨ in this Look ieleis to a Llock in
Why Is a Block in HDFS So Large?
HDFS Llocks aie laige compaieu to uisk Llocks, anu the ieason is to minimize the cost
ol seeks. By making a Llock laige enough, the time to tianslei the uata liom the uisk
can Le maue to Le signilicantly laigei than the time to seek to the stait ol the Llock.
Thus the time to tianslei a laige lile maue ol multiple Llocks opeiates at the uisk tianslei
A guick calculation shows that il the seek time is aiounu 10 ms, anu the tianslei iate
is 100 MB/s, then to make the seek time 1º ol the tianslei time, we neeu to make the
Llock size aiounu 100 MB. The uelault is actually 6+ MB, although many HDFS in-
stallations use 12S MB Llocks. This liguie will continue to Le ieviseu upwaiu as tianslei
speeus giow with new geneiations ol uisk uiives.
This aigument shoulun`t Le taken too lai, howevei. Map tasks in MapReuuce noimally
opeiate on one Llock at a time, so il you have too lew tasks (lewei than noues in the
clustei), youi joLs will iun slowei than they coulu otheiwise.
Having a Llock aLstiaction loi a uistiiLuteu lilesystem Liings seveial Lenelits. The liist
Lenelit is the most oLvious: a lile can Le laigei than any single uisk in the netwoik.
Theie`s nothing that ieguiies the Llocks liom a lile to Le stoieu on the same uisk, so
they can take auvantage ol any ol the uisks in the clustei. In lact, it woulu Le possiLle,
il unusual, to stoie a single lile on an HDFS clustei whose Llocks lilleu all the uisks in
the clustei.
HDFS Concepts | 47
Seconu, making the unit ol aLstiaction a Llock iathei than a lile simplilies the stoiage
suLsystem. Simplicity is something to stiive loi all in all systems, Lut is especially
impoitant loi a uistiiLuteu system in which the lailuie moues aie so vaiieu. The stoiage
suLsystem ueals with Llocks, simplilying stoiage management (since Llocks aie a lixeu
size, it is easy to calculate how many can Le stoieu on a given uisk) anu eliminating
metauata conceins (Llocks aie just a chunk ol uata to Le stoieu÷lile metauata such as
peimissions inloimation uoes not neeu to Le stoieu with the Llocks, so anothei system
can hanule metauata sepaiately).
Fuitheimoie, Llocks lit well with ieplication loi pioviuing lault toleiance anu availa-
Lility. To insuie against coiiupteu Llocks anu uisk anu machine lailuie, each Llock is
ieplicateu to a small numLei ol physically sepaiate machines (typically thiee). Il a Llock
Lecomes unavailaLle, a copy can Le ieau liom anothei location in a way that is tians-
paient to the client. A Llock that is no longei availaLle uue to coiiuption oi machine
lailuie can Le ieplicateu liom its alteinative locations to othei live machines to Liing
the ieplication lactoi Lack to the noimal level. (See ¨Data Integiity¨ on page S3 loi
moie on guaiuing against coiiupt uata.) Similaily, some applications may choose to
set a high ieplication lactoi loi the Llocks in a populai lile to spieau the ieau loau on
the clustei.
Like its uisk lilesystem cousin, HDFS`s fsck commanu unueistanus Llocks. Foi exam-
ple, iunning:
% hadoop fsck / -files -blocks
will list the Llocks that make up each lile in the lilesystem. (See also ¨Filesystem check
(lsck)¨ on page 3+5.)
Namenodes and Datanodes
An HDFS clustei has two types ol noue opeiating in a mastei-woikei pattein: a nanc-
nodc (the mastei) anu a numLei ol datanodcs (woikeis). The namenoue manages the
lilesystem namespace. It maintains the lilesystem tiee anu the metauata loi all the liles
anu uiiectoiies in the tiee. This inloimation is stoieu peisistently on the local uisk in
the loim ol two liles: the namespace image anu the euit log. The namenoue also knows
the uatanoues on which all the Llocks loi a given lile aie locateu, howevei, it uoes
not stoie Llock locations peisistently, since this inloimation is ieconstiucteu liom
uatanoues when the system staits.
A c|icnt accesses the lilesystem on Lehall ol the usei Ly communicating with the name-
noue anu uatanoues. The client piesents a POSIX-like lilesystem inteilace, so the usei
coue uoes not neeu to know aLout the namenoue anu uatanoue to lunction.
Datanoues aie the woikhoises ol the lilesystem. They stoie anu ietiieve Llocks when
they aie tolu to (Ly clients oi the namenoue), anu they iepoit Lack to the namenoue
peiiouically with lists ol Llocks that they aie stoiing.
48 | Chapter 3: The Hadoop Distributed Filesystem
Vithout the namenoue, the lilesystem cannot Le useu. In lact, il the machine iunning
the namenoue weie oLliteiateu, all the liles on the lilesystem woulu Le lost since theie
woulu Le no way ol knowing how to ieconstiuct the liles liom the Llocks on the
uatanoues. Foi this ieason, it is impoitant to make the namenoue iesilient to lailuie,
anu Hauoop pioviues two mechanisms loi this.
The liist way is to Lack up the liles that make up the peisistent state ol the lilesystem
metauata. Hauoop can Le conliguieu so that the namenoue wiites its peisistent state
to multiple lilesystems. These wiites aie synchionous anu atomic. The usual conligu-
iation choice is to wiite to local uisk as well as a iemote NFS mount.
It is also possiLle to iun a sccondary nancnodc, which uespite its name uoes not act as
a namenoue. Its main iole is to peiiouically meige the namespace image with the euit
log to pievent the euit log liom Lecoming too laige. The seconuaiy namenoue usually
iuns on a sepaiate physical machine, since it ieguiies plenty ol CPU anu as much
memoiy as the namenoue to peiloim the meige. It keeps a copy ol the meigeu name-
space image, which can Le useu in the event ol the namenoue lailing. Howevei, the
state ol the seconuaiy namenoue lags that ol the piimaiy, so in the event ol total lailuie
ol the piimaiy, uata loss is almost ceitain. The usual couise ol action in this case is to
copy the namenoue`s metauata liles that aie on NFS to the seconuaiy anu iun it as the
new piimaiy.
See ¨The lilesystem image anu euit log¨ on page 33S loi moie uetails.
HDFS Federation
The namenoue keeps a ieleience to eveiy lile anu Llock in the lilesystem in memoiy,
which means that on veiy laige clusteis with many liles, memoiy Lecomes the limiting
lactoi loi scaling (see ¨How much memoiy uoes a namenoue neeu?¨ on page 306).
HDFS Feueiation, intiouuceu in the 0.23 ielease seiies, allows a clustei to scale Ly
auuing namenoues, each ol which manages a poition ol the lilesystem namespace. Foi
example, one namenoue might manage all the liles iooteu unuei /uscr, say, anu a seconu
namenoue might hanule liles unuei /sharc.
Unuei leueiation, each namenoue manages a nancspacc vo|unc, which is maue up ol
the metauata loi the namespace, anu a b|oc| poo| containing all the Llocks loi the liles
in the namespace. Namespace volumes aie inuepenuent ol each othei, which means
namenoues uo not communicate with one anothei, anu luitheimoie the lailuie ol one
namenoue uoes not allect the availaLility ol the namespaces manageu Ly othei namen-
oues. Block pool stoiage is not paititioneu, howevei, so uatanoues iegistei with each
namenoue in the clustei anu stoie Llocks liom multiple Llock pools.
To access a leueiateu HDFS clustei, clients use client-siue mount taLles to map lile
paths to namenoues. This is manageu in conliguiation using the ViewFileSystem, anu
vicwjs:// URIs.
HDFS Concepts | 49
HDFS High-Availability
The comLination ol ieplicating namenoue metauata on multiple lilesystems, anu using
the seconuaiy namenoue to cieate checkpoints piotects against uata loss, Lut uoes not
pioviue high-availaLility ol the lilesystem. The namenoue is still a sing|c point oj jai|-
urc (SPOF), since il it uiu lail, all clients÷incluuing MapReuuce joLs÷woulu Le un-
aLle to ieau, wiite, oi list liles, Lecause the namenoue is the sole iepositoiy ol the
metauata anu the lile-to-Llock mapping. In such an event the whole Hauoop system
woulu ellectively Le out ol seivice until a new namenoue coulu Le Liought online.
To iecovei liom a laileu namenoue in this situation, an auministiatoi staits a new
piimaiy namenoue with one ol the lilesystem metauata ieplicas, anu conliguies ua-
tanoues anu clients to use this new namenoue. The new namenoue is not aLle to seive
ieguests until it has i) loaueu its namespace image into memoiy, ii) ieplayeu its euit
log, anu iii) ieceiveu enough Llock iepoits liom the uatanoues to leave sale moue. On
laige clusteis with many liles anu Llocks, the time it takes loi a namenoue to stait liom
colu can Le 30 minutes oi moie.
The long iecoveiy time is a pioLlem loi ioutine maintenance too. In lact, since unex-
pecteu lailuie ol the namenoue is so iaie, the case loi planneu uowntime is actually
moie impoitant in piactice.
The 0.23 ielease seiies ol Hauoop iemeuies this situation Ly auuing suppoit loi HDFS
high-availaLility (HA). In this implementation theie is a paii ol namenoues in an active-
stanuLy conliguiation. In the event ol the lailuie ol the active namenoue, the stanuLy
takes ovei its uuties to continue seivicing client ieguests without a signilicant intei-
iuption. A lew aichitectuial changes aie neeueu to allow this to happen:
º The namenoues must use highly-availaLle shaieu stoiage to shaie the euit log. (In
the initial implementation ol HA this will ieguiie an NFS lilei, Lut in lutuie ieleases
moie options will Le pioviueu, such as a BookKeepei-Laseu system Luilt on Zoo-
Keepei.) Vhen a stanuLy namenoue comes up it ieaus up to the enu ol the shaieu
euit log to synchionize its state with the active namenoue, anu then continues to
ieau new entiies as they aie wiitten Ly the active namenoue.
º Datanoues must senu Llock iepoits to Loth namenoues since the Llock mappings
aie stoieu in a namenoue`s memoiy, anu not on uisk.
º Clients must Le conliguieu to hanule namenoue lailovei, which uses a mechanism
that is tianspaient to useis.
Il the active namenoue lails, then the stanuLy can take ovei veiy guickly (in a lew tens
ol seconus) since it has the latest state availaLle in memoiy: Loth the latest euit log
entiies, anu an up-to-uate Llock mapping. The actual oLseiveu lailovei time will Le
longei in piactice (aiounu a minute oi so), since the system neeus to Le conseivative
in ueciuing that the active namenoue has laileu.
50 | Chapter 3: The Hadoop Distributed Filesystem
In the unlikely event ol the stanuLy Leing uown when the active lails, the auministiatoi
can still stait the stanuLy liom colu. This is no woise than the non-HA case, anu liom
an opeiational point ol view it`s an impiovement, since the piocess is a stanuaiu op-
eiational pioceuuie Luilt into Hauoop.
Failover and fencing
The tiansition liom the active namenoue to the stanuLy is manageu Ly a new entity in
the system calleu the jai|ovcr contro||cr. Failovei contiolleis aie pluggaLle, Lut the liist
implementation uses ZooKeepei to ensuie that only one namenoue is active. Each
namenoue iuns a lightweight lailovei contiollei piocess whose joL it is to monitoi its
namenoue loi lailuies (using a simple heaitLeating mechanism) anu tiiggei a lailovei
shoulu a namenoue lail.
Failovei may also Le initiateu manually Ly an auminstiatoi, in the case ol ioutine
maintenance, loi example. This is known as a graccju| jai|ovcr, since the lailovei con-
tiollei aiianges an oiueily tiansition loi Loth namenoues to switch ioles.
In the case ol an ungiacelul lailovei, howevei, it is impossiLle to Le suie that the laileu
namenoue has stoppeu iunning. Foi example, a slow netwoik oi a netwoik paitition
can tiiggei a lailovei tiansition, even though the pieviously active namenoue is still
iunning, anu thinks it is still the active namenoue. The HA implementation goes to
gieat lengths to ensuie that the pieviously active namenoue is pieventeu liom uoing
any uamage anu causing coiiuption÷a methou known as jcncing. The system employs
a iange ol lencing mechanisms, incluuing killing the namenoue`s piocess, ievoking its
access to the shaieu stoiage uiiectoiy (typically Ly using a venuoi-specilic NFS com-
manu), anu uisaLling its netwoik poit via a iemote management commanu. As a last
iesoit, the pieviously active namenoue can Le lenceu with a technigue iathei giaphi-
cally known as STON|TH, oi ¨shoot the othei noue in the heau¨, which uses a speci-
alizeu powei uistiiLution unit to loiciLly powei uown the host machine.
Client lailovei is hanuleu tianspaiently Ly the client liLiaiy. The simplest implemen-
tation uses client-siue conliguiation to contiol lailovei. The HDFS URI uses a logical
hostname which is mappeu to a paii ol namenoue auuiesses (in the conliguiation lile),
anu the client liLiaiy tiies each namenoue auuiess until the opeiation succeeus.
The Command-Line Interface
Ve`ie going to have a look at HDFS Ly inteiacting with it liom the commanu line.
Theie aie many othei inteilaces to HDFS, Lut the commanu line is one ol the simplest
anu, to many uevelopeis, the most lamiliai.
Ve aie going to iun HDFS on one machine, so liist lollow the instiuctions loi setting
up Hauoop in pseuuo-uistiiLuteu moue in Appenuix A. Latei you`ll see how to iun on
a clustei ol machines to give us scalaLility anu lault toleiance.
The Command-Line Interface | 51
Theie aie two piopeities that we set in the pseuuo-uistiiLuteu conliguiation that ue-
seive luithei explanation. The liist is fs.default.name, set to hdjs://|oca|host/, which is
useu to set a uelault lilesystem loi Hauoop. Filesystems aie specilieu Ly a URI, anu
heie we have useu an hdfs URI to conliguie Hauoop to use HDFS Ly uelault. The HDFS
uaemons will use this piopeity to ueteimine the host anu poit loi the HDFS namenoue.
Ve`ll Le iunning it on localhost, on the uelault HDFS poit, S020. Anu HDFS clients
will use this piopeity to woik out wheie the namenoue is iunning so they can connect
to it.
Ve set the seconu piopeity, dfs.replication, to 1 so that HDFS uoesn`t ieplicate
lilesystem Llocks Ly the uelault lactoi ol thiee. Vhen iunning with a single uatanoue,
HDFS can`t ieplicate Llocks to thiee uatanoues, so it woulu peipetually wain aLout
Llocks Leing unuei-ieplicateu. This setting solves that pioLlem.
Basic Filesystem Operations
The lilesystem is ieauy to Le useu, anu we can uo all ol the usual lilesystem opeiations
such as ieauing liles, cieating uiiectoiies, moving liles, ueleting uata, anu listing uiiec-
toiies. You can type hadoop fs -help to get uetaileu help on eveiy commanu.
Stait Ly copying a lile liom the local lilesystem to HDFS:
% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/
This commanu invokes Hauoop`s lilesystem shell commanu fs, which suppoits a
numLei ol suLcommanus÷in this case, we aie iunning -copyFromLocal. The local lile
quang|c.txt is copieu to the lile /uscr/ton/quang|c.txt on the HDFS instance iunning on
localhost. In lact, we coulu have omitteu the scheme anu host ol the URI anu pickeu
up the uelault, hdfs://localhost, as specilieu in corc-sitc.xn|:
% hadoop fs -copyFromLocal input/docs/quangle.txt /user/tom/quangle.txt
Ve coulu also have useu a ielative path anu copieu the lile to oui home uiiectoiy in
HDFS, which in this case is /uscr/ton:
% hadoop fs -copyFromLocal input/docs/quangle.txt quangle.txt
Let`s copy the lile Lack to the local lilesystem anu check whethei it`s the same:
% hadoop fs -copyToLocal quangle.txt quangle.copy.txt
% md5 input/docs/quangle.txt quangle.copy.txt
MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9
MD5 (quangle.copy.txt) = a16f231da6b05e2ba7a339320e7dacd9
The MD5 uigests aie the same, showing that the lile suiviveu its tiip to HDFS anu is
Lack intact.
Finally, let`s look at an HDFS lile listing. Ve cieate a uiiectoiy liist just to see how it
is uisplayeu in the listing:
52 | Chapter 3: The Hadoop Distributed Filesystem
% hadoop fs -mkdir books
% hadoop fs -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-02 22:41 /user/tom/books
-rw-r--r-- 1 tom supergroup 118 2009-04-02 22:29 /user/tom/quangle.txt
The inloimation ietuineu is veiy similai to the Unix commanu ls -l, with a lew minoi
uilleiences. The liist column shows the lile moue. The seconu column is the ieplication
lactoi ol the lile (something a tiauitional Unix lilesystem uoes not have). RememLei
we set the uelault ieplication lactoi in the site-wiue conliguiation to Le 1, which is why
we see the same value heie. The entiy in this column is empty loi uiiectoiies since the
concept ol ieplication uoes not apply to them÷uiiectoiies aie tieateu as metauata anu
stoieu Ly the namenoue, not the uatanoues. The thiiu anu louith columns show the
lile ownei anu gioup. The lilth column is the size ol the lile in Lytes, oi zeio loi uiiec-
toiies. The sixth anu seventh columns aie the last mouilieu uate anu time. Finally, the
eighth column is the aLsolute name ol the lile oi uiiectoiy.
File Permissions in HDFS
HDFS has a peimissions mouel loi liles anu uiiectoiies that is much like POSIX.
Theie aie thiee types ol peimission: the ieau peimission (r), the wiite peimission (w),
anu the execute peimission (x). The ieau peimission is ieguiieu to ieau liles oi list the
contents ol a uiiectoiy. The wiite peimission is ieguiieu to wiite a lile, oi loi a uiiectoiy,
to cieate oi uelete liles oi uiiectoiies in it. The execute peimission is ignoieu loi a lile
since you can`t execute a lile on HDFS (unlike POSIX), anu loi a uiiectoiy it is ieguiieu
to access its chiluien.
Each lile anu uiiectoiy has an owncr, a group, anu a nodc. The moue is maue up ol the
peimissions loi the usei who is the ownei, the peimissions loi the useis who aie
memLeis ol the gioup, anu the peimissions loi useis who aie neithei the owneis noi
memLeis ol the gioup.
By uelault, a client`s iuentity is ueteimineu Ly the useiname anu gioups ol the piocess
it is iunning in. Because clients aie iemote, this makes it possiLle to Lecome an aiLitiaiy
usei, simply Ly cieating an account ol that name on the iemote system. Thus, peimis-
sions shoulu Le useu only in a coopeiative community ol useis, as a mechanism loi
shaiing lilesystem iesouices anu loi avoiuing acciuental uata loss, anu not loi secuiing
iesouices in a hostile enviionment. (Note, howevei, that the latest veisions ol Hauoop
suppoit KeiLeios authentication, which iemoves these iestiictions, see ¨Secu-
iity¨ on page 323.) Despite these limitations, it is woithwhile having peimissions en-
aLleu (as it is Ly uelault; see the dfs.permissions piopeity), to avoiu acciuental moui-
lication oi ueletion ol suLstantial paits ol the lilesystem, eithei Ly useis oi Ly automateu
tools oi piogiams.
Vhen peimissions checking is enaLleu, the ownei peimissions aie checkeu il the cli-
ent`s useiname matches the ownei, anu the gioup peimissions aie checkeu il the client
is a memLei ol the gioup; otheiwise, the othei peimissions aie checkeu.
The Command-Line Interface | 53
Theie is a concept ol a supei-usei, which is the iuentity ol the namenoue piocess.
Peimissions checks aie not peiloimeu loi the supei-usei.
Hadoop Filesystems
Hauoop has an aLstiact notion ol lilesystem, ol which HDFS is just one implementa-
tion. The ]ava aLstiact class org.apache.hadoop.fs.FileSystem iepiesents a lilesystem
in Hauoop, anu theie aie seveial conciete implementations, which aie uesciiLeu in
TaLle 3-1.
Tab|c 3-1. Hadoop ji|csystcns
Filesystem URI scheme Java implementation
(all under org.apache.hadoop)
Local file fs.LocalFileSystem A filesystem for a locally connected disk with client-
side checksums. Use RawLocalFileSystem for a
local filesystem with no checksums. See “LocalFileSys-
tem” on page 84.
HDFS hdfs hdfs.
Hadoop’s distributed filesystem. HDFS is designed to
work efficiently in conjunction with MapReduce.
HFTP hftp hdfs.HftpFileSystem A filesystem providing read-only access to HDFS over
HTTP. (Despite its name, HFTP has no connection with
FTP.) Often used with distcp (see “Parallel Copying with
distcp” on page 76) to copy data between HDFS
clusters running different versions.
HSFTP hsftp hdfs.HsftpFileSystem A filesystem providing read-only access to HDFS over
HTTPS. (Again, this has no connection with FTP.)
WebHDFS webhdfs hdfs.web.WebHdfsFile
A filesystem providing secure read-write access to HDFS
over HTTP. WebHDFS is intended as a replacement for
HAR har fs.HarFileSystem A filesystem layered on another filesystem for archiving
files. Hadoop Archives are typically used for archiving files
in HDFS to reduce the namenode’s memory usage. See
“Hadoop Archives” on page 78.
KFS (Cloud-
kfs fs.kfs.
CloudStore (formerly Kosmos filesystem) is a dis-
tributed filesystem like HDFS or Google’s GFS, written in
C++. Find more information about it at
FTP ftp fs.ftp.FTPFileSystem A filesystem backed by an FTP server.
S3 (native) s3n fs.s3native.
A filesystem backed by Amazon S3. See http://wiki
S3 (block-
s3 fs.s3.S3FileSystem A filesystem backed by Amazon S3, which stores files in
blocks (much like HDFS) to overcome S3’s 5 GB file size
54 | Chapter 3: The Hadoop Distributed Filesystem
Filesystem URI scheme Java implementation
(all under org.apache.hadoop)
hdfs hdfs.DistributedRaidFi
A “RAID” version of HDFS designed for archival storage.
For each file in HDFS, a (smaller) parity file is created,
which allows the HDFS replication to be reduced from
three to two, which reduces disk usage by 25% to 30%,
while keeping the probability of data loss the same. Dis-
tributed RAID requires that you run a RaidNode daemon
on the cluster.
View viewfs viewfs.ViewFileSystem A client-side mount table for other Hadoop filesystems.
Commonly used to create mount points for federated
namenodes (see “HDFS Federation” on page 49).
Hauoop pioviues many inteilaces to its lilesystems, anu it geneially uses the URI
scheme to pick the coiiect lilesystem instance to communicate with. Foi example, the
lilesystem shell that we met in the pievious section opeiates with all Hauoop lilesys-
tems. To list the liles in the ioot uiiectoiy ol the local lilesystem, type:
% hadoop fs -ls file:///
Although it is possiLle (anu sometimes veiy convenient) to iun MapReuuce piogiams
that access any ol these lilesystems, when you aie piocessing laige volumes ol uata,
you shoulu choose a uistiiLuteu lilesystem that has the uata locality optimization, no-
taLly HDFS (see ¨Scaling Out¨ on page 30).
Hauoop is wiitten in ]ava, anu all Hauoop lilesystem inteiactions aie meuiateu thiough
the ]ava API. The lilesystem shell, loi example, is a ]ava application that uses the ]ava
FileSystem class to pioviue lilesystem opeiations. The othei lilesystem inteilaces aie
uiscusseu Liielly in this section. These inteilaces aie most commonly useu with HDFS,
since the othei lilesystems in Hauoop typically have existing tools to access the unuei-
lying lilesystem (FTP clients loi FTP, S3 tools loi S3, etc.), Lut many ol them will woik
with any Hauoop lilesystem.
Theie aie two ways ol accessing HDFS ovei HTTP: uiiectly, wheie the HDFS uaemons
seive HTTP ieguests to clients; anu via a pioxy (oi pioxies), which accesses HDFS on
the client`s Lehall using the usual DistributedFileSystem API. The two ways aie illus-
tiateu in Figuie 3-1.
Hadoop Filesystems | 55
Iigurc 3-1. Acccssing HDIS ovcr HTTP dircct|y, and via a ban| oj HDIS proxics
In the liist case, uiiectoiy listings aie seiveu Ly the namenoue`s emLeuueu weL seivei
(which iuns on poit 50070) loimatteu in XML oi ]SON, while lile uata is stieameu
liom uatanoues Ly theii weL seiveis (iunning on poit 50075).
The oiiginal uiiect HTTP inteilace (HFTP anu HSFTP) was ieau-only, while the new
VeLHDFS implementation suppoits all lilesystem opeiations, incluuing KeiLeios au-
thentication. VeLHDFS must Le enaLleu Ly setting dfs.webhdfs.enabled to tiue, loi
you to Le aLle to use wcbhdjs URIs.
The seconu way ol accessing HDFS ovei HTTP ielies on one oi moie stanualone pioxy
seiveis. (The pioxies aie stateless so they can iun Lehinu a stanuaiu loau Lalancei.) All
tiallic to the clustei passes thiough the pioxy. This allows loi stiictei liiewall anu
Lanuwiuth limiting policies to Le put in place. It`s common to use a pioxy loi tiansleis
Letween Hauoop clusteis locateu in uilleient uata centeis.
The oiiginal HDFS pioxy (in src/contrib/hdjsproxy) was ieau-only, anu coulu Le ac-
cesseu Ly clients using the HSFTP FileSystem implementation (hsjtp URIs). Fiom ie-
lease 0.23, theie is a new pioxy calleu HttpFS that has ieau anu wiite capaLilities, anu
which exposes the same HTTP inteilace as VeLHDFS, so clients can access eithei using
wcbhdjs URIs.
The HTTP REST API that VeLHDFS exposes is loimally uelineu in a specilication, so
it is likely that ovei time clients in languages othei than ]ava will Le wiitten that use it
Hauoop pioviues a C liLiaiy calleu |ibhdjs that miiiois the ]ava FileSystem inteilace
(it was wiitten as a C liLiaiy loi accessing HDFS, Lut uespite its name it can Le useu
to access any Hauoop lilesystem). It woiks using the java Nativc |ntcrjacc (]NI) to call
a ]ava lilesystem client.
56 | Chapter 3: The Hadoop Distributed Filesystem
The C API is veiy similai to the ]ava one, Lut it typically lags the ]ava one, so newei
leatuies may not Le suppoiteu. You can linu the geneiateu uocumentation loi the C
API in the |ibhdjs/docs/api uiiectoiy ol the Hauoop uistiiLution.
Hauoop comes with pieLuilt |ibhdjs Linaiies loi 32-Lit Linux, Lut loi othei platloims,
you will neeu to Luilu them youisell using the instiuctions at http://wi|i.apachc.org/
Ii|csystcn in Uscrspacc (FUSE) allows lilesystems that aie implementeu in usei space
to Le integiateu as a Unix lilesystem. Hauoop`s Fuse-DFS contiiL mouule allows any
Hauoop lilesystem (Lut typically HDFS) to Le mounteu as a stanuaiu lilesystem. You
can then use Unix utilities (such as ls anu cat) to inteiact with the lilesystem, as well
as POSIX liLiaiies to access the lilesystem liom any piogiamming language.
Fuse-DFS is implementeu in C using |ibhdjs as the inteilace to HDFS. Documentation
loi compiling anu iunning Fuse-DFS is locateu in the src/contrib/jusc-djs uiiectoiy ol
the Hauoop uistiiLution.
The Java Interface
In this section, we uig into the Hauoop`s FileSystem class: the API loi inteiacting with
one ol Hauoop`s lilesystems.
Vhile we locus mainly on the HDFS implementation,
DistributedFileSystem, in geneial you shoulu stiive to wiite youi coue against the
FileSystem aLstiact class, to ietain poitaLility acioss lilesystems. This is veiy uselul
when testing youi piogiam, loi example, since you can iapiuly iun tests using uata
stoieu on the local lilesystem.
Reading Data from a Hadoop URL
One ol the simplest ways to ieau a lile liom a Hauoop lilesystem is Ly using a
java.net.URL oLject to open a stieam to ieau the uata liom. The geneial iuiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
Theie`s a little Lit moie woik ieguiieu to make ]ava iecognize Hauoop`s hdfs URL
scheme. This is achieveu Ly calling the setURLStreamHandlerFactory methou on URL
5. Fiom ielease 0.21.0, theie is a new lilesystem inteilace calleu FileContext with Lettei hanuling ol multiple
lilesystems (so a single FileContext can iesolve multiple lilesystem schemes, loi example) anu a cleanei,
moie consistent inteilace.
The Java Interface | 57
with an instance ol FsUrlStreamHandlerFactory. This methou can only Le calleu once
pei ]VM, so it is typically executeu in a static Llock. This limitation means that il some
othei pait ol youi piogiam÷peihaps a thiiu-paity component outsiue youi contiol÷
sets a URLStreamHandlerFactory, you won`t Le aLle to use this appioach loi ieauing uata
liom Hauoop. The next section uiscusses an alteinative.
Example 3-1 shows a piogiam loi uisplaying liles liom Hauoop lilesystems on stanuaiu
output, like the Unix cat commanu.
Exanp|c 3-1. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output using a
public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());

public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
Ve make use ol the hanuy IOUtils class that comes with Hauoop loi closing the stieam
in the finally clause, anu also loi copying Lytes Letween the input stieam anu the
output stieam (System.out in this case). The last two aiguments to the copyBytes
methou aie the Lullei size useu loi copying anu whethei to close the stieams when the
copy is complete. Ve close the input stieam ouiselves, anu System.out uoesn`t neeu to
Le closeu.
58 | Chapter 3: The Hadoop Distributed Filesystem
Heie`s a sample iun:
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Reading Data Using the FileSystem API
As the pievious section explaineu, sometimes it is impossiLle to set a URLStreamHand
lerFactory loi youi application. In this case, you will neeu to use the FileSystem API
to open an input stieam loi a lile.
A lile in a Hauoop lilesystem is iepiesenteu Ly a Hauoop Path oLject (anu not
a java.io.File oLject, since its semantics aie too closely tieu to the local lilesystem).
You can think ol a Path as a Hauoop lilesystem URI, such as hdjs://|oca|host/uscr/ton/
FileSystem is a geneial lilesystem API, so the liist step is to ietiieve an instance loi the
lilesystem we want to use÷HDFS in this case. Theie aie seveial static lactoiy methous
loi getting a FileSystem instance:
public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException
A Configuration oLject encapsulates a client oi seivei`s conliguiation, which is set using
conliguiation liles ieau liom the classpath, such as conj/corc-sitc.xn|. The liist methou
ietuins the uelault lilesystem (as specilieu in the lile conj/corc-sitc.xn|, oi the uelault
local lilesystem il not specilieu theie). The seconu uses the given URI`s scheme anu
authoiity to ueteimine the lilesystem to use, lalling Lack to the uelault lilesystem il no
scheme is specilieu in the given URI. The thiiu ietiieves the lilesystem as the given usei.
In some cases, you may want to ietiieve a local lilesystem instance, in which case you
can use the convenience methou, getLocal():
public static LocalFileSystem getLocal(Configuration conf) throws IOException
Vith a FileSystem instance in hanu, we invoke an open() methou to get the input stieam
loi a lile:
public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
The liist methou uses a uelault Lullei size ol + K.
Putting this togethei, we can iewiite Example 3-1 as shown in Example 3-2.
6. The text is liom Thc Quang|c Wang|c`s Hat Ly Euwaiu Leai.
The Java Interface | 59
Exanp|c 3-2. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output by using thc Ii|cSystcn
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
The piogiam iuns as lollows:
% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
The open() methou on FileSystem actually ietuins a FSDataInputStream iathei than a
stanuaiu java.io class. This class is a specialization ol java.io.DataInputStream with
suppoit loi ianuom access, so you can ieau liom any pait ol the stieam:
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
The Seekable inteilace peimits seeking to a position in the lile anu a gueiy methou loi
the cuiient ollset liom the stait ol the lile (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
Calling seek() with a position that is gieatei than the length ol the lile will iesult in an
IOException. Unlike the skip() methou ol java.io.InputStream that positions the
stieam at a point latei than the cuiient position, seek() can move to an aiLitiaiy, aL-
solute position in the lile.
60 | Chapter 3: The Hadoop Distributed Filesystem
Example 3-3 is a simple extension ol Example 3-2 that wiites a lile to stanuaiu out
twice: altei wiiting it once, it seeks to the stait ol the lile anu stieams thiough it once
Exanp|c 3-3. Disp|aying ji|cs jron a Hadoop ji|csystcn on standard output twicc, by using scc|
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
Heie`s the iesult ol iunning it on a small lile:
% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream also implements the PositionedReadable inteilace loi ieauing paits
ol a lile at a given ollset:
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length)
throws IOException;

public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;

public void readFully(long position, byte[] buffer) throws IOException;
The read() methou ieaus up to length Lytes liom the given position in the lile into the
buffer at the given offset in the Lullei. The ietuin value is the numLei ol Lytes actually
ieau: calleis shoulu check this value as it may Le less than length. The readFully()
methous will ieau length Lytes into the Lullei (oi buffer.length Lytes loi the veision
The Java Interface | 61
that just takes a Lyte aiiay buffer), unless the enu ol the lile is ieacheu, in which case
an EOFException is thiown.
All ol these methous pieseive the cuiient ollset in the lile anu aie thieau-sale, so they
pioviue a convenient way to access anothei pait ol the lile÷metauata peihaps÷while
ieauing the main Louy ol the lile. In lact, they aie just implementeu using the
Seekable inteilace using the lollowing pattein:
long oldPos = getPos();
try {
// read data
} finally {
Finally, Leai in minu that calling seek() is a ielatively expensive opeiation anu shoulu
Le useu spaiingly. You shoulu stiuctuie youi application access patteins to iely on
stieaming uata, (Ly using MapReuuce, loi example) iathei than peiloiming a laige
numLei ol seeks.
Writing Data
The FileSystem class has a numLei ol methous loi cieating a lile. The simplest is the
methou that takes a Path oLject loi the lile to Le cieateu anu ietuins an output stieam
to wiite to:
public FSDataOutputStream create(Path f) throws IOException
Theie aie oveiloaueu veisions ol this methou that allow you to specily whethei to
loiciLly oveiwiite existing liles, the ieplication lactoi ol the lile, the Lullei size to use
when wiiting the lile, the Llock size loi the lile, anu lile peimissions.
The create() methous cieate any paient uiiectoiies ol the lile to Le
wiitten that uon`t alieauy exist. Though convenient, this Lehavioi may
Le unexpecteu. Il you want the wiite to lail il the paient uiiectoiy uoesn`t
exist, then you shoulu check loi the existence ol the paient uiiectoiy
liist Ly calling the exists() methou.
Theie`s also an oveiloaueu methou loi passing a callLack inteilace, Progressable, so
youi application can Le notilieu ol the piogiess ol the uata Leing wiitten to the
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
62 | Chapter 3: The Hadoop Distributed Filesystem
As an alteinative to cieating a new lile, you can appenu to an existing lile using the
append() methou (theie aie also some othei oveiloaueu veisions):
public FSDataOutputStream append(Path f) throws IOException
The appenu opeiation allows a single wiitei to mouily an alieauy wiitten lile Ly opening
it anu wiiting uata liom the linal ollset in the lile. Vith this API, applications that
piouuce unLounueu liles, such as logliles, can wiite to an existing lile altei a iestait,
loi example. The appenu opeiation is optional anu not implementeu Ly all Hauoop
lilesystems. Foi example, HDFS suppoits appenu, Lut S3 lilesystems uon`t.
Example 3-+ shows how to copy a local lile to a Hauoop lilesystem. Ve illustiate pio-
giess Ly piinting a peiiou eveiy time the progress() methou is calleu Ly Hauoop, which
is altei each 6+ K packet ol uata is wiitten to the uatanoue pipeline. (Note that this
paiticulai Lehavioi is not specilieu Ly the API, so it is suLject to change in latei veisions
ol Hauoop. The API meiely allows you to inlei that ¨something is happening.¨)
Exanp|c 3-1. Copying a |oca| ji|c to a Hadoop ji|csystcn
public class FileCopyWithProgress {
public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];

InputStream in = new BufferedInputStream(new FileInputStream(localSrc));

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {

IOUtils.copyBytes(in, out, 4096, true);
Typical usage:
% hadoop FileCopyWithProgress input/docs/1400-8.txt hdfs://localhost/user/tom/
Cuiiently, none ol the othei Hauoop lilesystems call progress() uuiing wiites. Piogiess
is impoitant in MapReuuce applications, as you will see in latei chapteis.
The create() methou on FileSystem ietuins an FSDataOutputStream, which, like
FSDataInputStream, has a methou loi gueiying the cuiient position in the lile:
package org.apache.hadoop.fs;
The Java Interface | 63
public class FSDataOutputStream extends DataOutputStream implements Syncable {
public long getPos() throws IOException {
// implementation elided

// implementation elided
Howevei, unlike FSDataInputStream, FSDataOutputStream uoes not peimit seeking. This
is Lecause HDFS allows only seguential wiites to an open lile oi appenus to an alieauy
wiitten lile. In othei woius, theie is no suppoit loi wiiting to anywheie othei than the
enu ol the lile, so theie is no value in Leing aLle to seek while wiiting.
FileSystem pioviues a methou to cieate a uiiectoiy:
public boolean mkdirs(Path f) throws IOException
This methou cieates all ol the necessaiy paient uiiectoiies il they uon`t alieauy exist,
just like the java.io.File`s mkdirs() methou. It ietuins true il the uiiectoiy (anu all
paient uiiectoiies) was (weie) successlully cieateu.
Olten, you uon`t neeu to explicitly cieate a uiiectoiy, since wiiting a lile, Ly calling
create(), will automatically cieate any paient uiiectoiies.
Querying the Filesystem
File metadata: FileStatus
An impoitant leatuie ol any lilesystem is the aLility to navigate its uiiectoiy stiuctuie
anu ietiieve inloimation aLout the liles anu uiiectoiies that it stoies. The FileStatus
class encapsulates lilesystem metauata loi liles anu uiiectoiies, incluuing lile length,
Llock size, ieplication, mouilication time, owneiship, anu peimission inloimation.
The methou getFileStatus() on FileSystem pioviues a way ol getting a FileStatus
oLject loi a single lile oi uiiectoiy. Example 3-5 shows an example ol its use.
Exanp|c 3-5. Dcnonstrating ji|c status injornation
public class ShowFileStatusTest {

private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing
private FileSystem fs;
public void setUp() throws IOException {
Configuration conf = new Configuration();
if (System.getProperty("test.build.data") == null) {
System.setProperty("test.build.data", "/tmp");
64 | Chapter 3: The Hadoop Distributed Filesystem
cluster = new MiniDFSCluster(conf, 1, true, null);
fs = cluster.getFileSystem();
OutputStream out = fs.create(new Path("/dir/file"));

public void tearDown() throws IOException {
if (fs != null) { fs.close(); }
if (cluster != null) { cluster.shutdown(); }

@Test(expected = FileNotFoundException.class)
public void throwsFileNotFoundForNonExistentFile() throws IOException {
fs.getFileStatus(new Path("no-such-file"));

public void fileStatusForFile() throws IOException {
Path file = new Path("/dir/file");
FileStatus stat = fs.getFileStatus(file);
assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));
assertThat(stat.isDir(), is(false));
assertThat(stat.getLen(), is(7L));
assertThat(stat.getReplication(), is((short) 1));
assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rw-r--r--"));

public void fileStatusForDirectory() throws IOException {
Path dir = new Path("/dir");
FileStatus stat = fs.getFileStatus(dir);
assertThat(stat.getPath().toUri().getPath(), is("/dir"));
assertThat(stat.isDir(), is(true));
assertThat(stat.getLen(), is(0L));
assertThat(stat.getReplication(), is((short) 0));
assertThat(stat.getBlockSize(), is(0L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));

The Java Interface | 65
Il no lile oi uiiectoiy exists, a FileNotFoundException is thiown. Howevei, il you aie
inteiesteu only in the existence ol a lile oi uiiectoiy, then the exists() methou on
FileSystem is moie convenient:
public boolean exists(Path f) throws IOException
Listing files
Finuing inloimation on a single lile oi uiiectoiy is uselul, Lut you also olten neeu to Le
aLle to list the contents ol a uiiectoiy. That`s what FileSystem`s listStatus() methous
aie loi:
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException
Vhen the aigument is a lile, the simplest vaiiant ietuins an aiiay ol FileStatus oLjects
ol length 1. Vhen the aigument is a uiiectoiy, it ietuins zeio oi moie FileStatus oLjects
iepiesenting the liles anu uiiectoiies containeu in the uiiectoiy.
Oveiloaueu vaiiants allow a PathFilter to Le supplieu to iestiict the liles anu uiiectoiies
to match÷you will see an example in section ¨PathFiltei¨ on page 6S. Finally, il you
specily an aiiay ol paths, the iesult is a shoitcut loi calling the eguivalent single-path
listStatus methou loi each path in tuin anu accumulating the FileStatus oLject aiiays
in a single aiiay. This can Le uselul loi Luiluing up lists ol input liles to piocess liom
uistinct paits ol the lilesystem tiee. Example 3-6 is a simple uemonstiation ol this iuea.
Note the use ol stat2Paths() in FileUtil loi tuining an aiiay ol FileStatus oLjects to
an aiiay ol Path oLjects.
Exanp|c 3-ó. Showing thc ji|c statuscs jor a co||cction oj paths in a Hadoop ji|csystcn
public class ListStatus {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path[] paths = new Path[args.length];
for (int i = 0; i < paths.length; i++) {
paths[i] = new Path(args[i]);

FileStatus[] status = fs.listStatus(paths);
Path[] listedPaths = FileUtil.stat2Paths(status);
for (Path p : listedPaths) {
66 | Chapter 3: The Hadoop Distributed Filesystem
Ve can use this piogiam to linu the union ol uiiectoiy listings loi a collection ol paths:
% hadoop ListStatus hdfs://localhost/ hdfs://localhost/user/tom
File patterns
It is a common ieguiiement to piocess sets ol liles in a single opeiation. Foi example,
a MapReuuce joL loi log piocessing might analyze a month`s woith ol liles containeu
in a numLei ol uiiectoiies. Rathei than having to enumeiate each lile anu uiiectoiy to
specily the input, it is convenient to use wilucaiu chaiacteis to match multiple liles
with a single expiession, an opeiation that is known as g|obbing. Hauoop pioviues two
FileSystem methou loi piocessing gloLs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
The globStatus() methou ietuins an aiiay ol FileStatus oLjects whose paths match
the supplieu pattein, soiteu Ly path. An optional PathFilter can Le specilieu to iestiict
the matches luithei.
Hauoop suppoits the same set ol gloL chaiacteis as Unix bash (see TaLle 3-2).
Tab|c 3-2. G|ob charactcrs and thcir ncanings
Glob Name Matches
* asterisk Matches zero or more characters
? question mark Matches a single character
[ab] character class Matches a single character in the set {a, b}
[^ab] negated character class Matches a single character that is not in the set {a, b}
[a-b] character range Matches a single character in the (closed) range [a, b], where a is lexicographically
less than or equal to b
[^a-b] negated character range Matches a single character that is not in the (closed) range [a, b], where a is
lexicographically less than or equal to b
{a,b} alternation Matches either expression a or b
\c escaped character Matches character c when it is a metacharacter
Imagine that logliles aie stoieu in a uiiectoiy stiuctuie oiganizeu hieiaichically Ly
uate. So, loi example, logliles loi the last uay ol 2007 woulu go in a uiiectoiy
nameu /2007/12/31. Suppose that the lull lile listing is:
º /2007/12/30
º /2007/12/31
º /2008/01/01
The Java Interface | 67
º /2008/01/02
Heie aie some lile gloLs anu theii expansions:
Glob Expansion
/* /2007 /2008
/*/* /2007/12 /2008/01
/*/12/* /2007/12/30 /2007/12/31
/200? /2007 /2008
/200[78] /2007 /2008
/200[7-8] /2007 /2008
/200[^01234569] /2007 /2008
/*/*/{31,01} /2007/12/31 /2008/01/01
/*/*/3{0,1} /2007/12/30 /2007/12/31
/*/{12/31,01/01} /2007/12/31 /2008/01/01
GloL patteins aie not always poweilul enough to uesciiLe a set ol liles you want to
access. Foi example, it is not geneially possiLle to excluue a paiticulai lile using a gloL
pattein. The listStatus() anu globStatus() methous ol FileSystem take an optional
PathFilter, which allows piogiammatic contiol ovei matching:
package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
PathFilter is the eguivalent ol java.io.FileFilter loi Path oLjects iathei than File
Example 3-7 shows a PathFilter loi excluuing paths that match a iegulai expiession.
Exanp|c 3-7. A PathIi|tcr jor cxc|uding paths that natch a rcgu|ar cxprcssion
public class RegexExcludePathFilter implements PathFilter {

private final String regex;
public RegexExcludePathFilter(String regex) {
this.regex = regex;
public boolean accept(Path path) {
return !path.toString().matches(regex);
68 | Chapter 3: The Hadoop Distributed Filesystem
The liltei passes only liles that don`t match the iegulai expiession. Ve use the liltei in
conjunction with a gloL that picks out an initial set ol liles to incluue: the liltei is useu
to ieline the iesults. Foi example:
fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))
will expanu to /2007/12/30.
Filteis can only act on a lile`s name, as iepiesenteu Ly a Path. They can`t use a lile`s
piopeities, such as cieation time, as the Lasis ol the liltei. Neveitheless, they can pei-
loim matching that neithei gloL patteins noi iegulai expiessions can achieve. Foi ex-
ample, il you stoie liles in a uiiectoiy stiuctuie that is laiu out Ly uate (like in the
pievious section), then you can wiite a PathFilter to pick out liles that lall in a given
uate iange.
Deleting Data
Use the delete() methou on FileSystem to peimanently iemove liles oi uiiectoiies:
public boolean delete(Path f, boolean recursive) throws IOException
Il f is a lile oi an empty uiiectoiy, then the value ol recursive is ignoieu. A nonempty
uiiectoiy is only ueleteu, along with its contents, il recursive is true (otheiwise an
IOException is thiown).
Data Flow
Anatomy of a File Read
To get an iuea ol how uata llows Letween the client inteiacting with HDFS, the name-
noue anu the uatanoues, consiuei Figuie 3-2, which shows the main seguence ol events
when ieauing a lile.
Data Flow | 69
Iigurc 3-2. A c|icnt rcading data jron HDIS
The client opens the lile it wishes to ieau Ly calling open() on the FileSystem oLject,
which loi HDFS is an instance ol DistributedFileSystem (step 1 in Figuie 3-2).
DistributedFileSystem calls the namenoue, using RPC, to ueteimine the locations ol
the Llocks loi the liist lew Llocks in the lile (step 2). Foi each Llock, the namenoue
ietuins the auuiesses ol the uatanoues that have a copy ol that Llock. Fuitheimoie, the
uatanoues aie soiteu accoiuing to theii pioximity to the client (accoiuing to the top-
ology ol the clustei`s netwoik; see ¨Netwoik Topology anu Hauoop¨ on page 71). Il
the client is itsell a uatanoue (in the case ol a MapReuuce task, loi instance), then it
will ieau liom the local uatanoue, il it hosts a copy ol the Llock (see also Figuie 2-2).
The DistributedFileSystem ietuins an FSDataInputStream (an input stieam that sup-
poits lile seeks) to the client loi it to ieau uata liom. FSDataInputStream in tuin wiaps
a DFSInputStream, which manages the uatanoue anu namenoue I/O.
The client then calls read() on the stieam (step 3). DFSInputStream, which has stoieu
the uatanoue auuiesses loi the liist lew Llocks in the lile, then connects to the liist
(closest) uatanoue loi the liist Llock in the lile. Data is stieameu liom the uatanoue
Lack to the client, which calls read() iepeateuly on the stieam (step +). Vhen the enu
ol the Llock is ieacheu, DFSInputStream will close the connection to the uatanoue, then
linu the Lest uatanoue loi the next Llock (step 5). This happens tianspaiently to the
client, which liom its point ol view is just ieauing a continuous stieam.
Blocks aie ieau in oiuei with the DFSInputStream opening new connections to uatanoues
as the client ieaus thiough the stieam. It will also call the namenoue to ietiieve the
uatanoue locations loi the next Latch ol Llocks as neeueu. Vhen the client has linisheu
ieauing, it calls close() on the FSDataInputStream (step 6).
70 | Chapter 3: The Hadoop Distributed Filesystem
Duiing ieauing, il the DFSInputStream encounteis an eiioi while communicating with
a uatanoue, then it will tiy the next closest one loi that Llock. It will also iememLei
uatanoues that have laileu so that it uoesn`t neeulessly ietiy them loi latei Llocks. The
DFSInputStream also veiilies checksums loi the uata tiansleiieu to it liom the uatanoue.
Il a coiiupteu Llock is lounu, it is iepoiteu to the namenoue Leloie the DFSInput
Stream attempts to ieau a ieplica ol the Llock liom anothei uatanoue.
One impoitant aspect ol this uesign is that the client contacts uatanoues uiiectly to
ietiieve uata anu is guiueu Ly the namenoue to the Lest uatanoue loi each Llock. This
uesign allows HDFS to scale to a laige numLei ol concuiient clients, since the uata
tiallic is spieau acioss all the uatanoues in the clustei. The namenoue meanwhile meiely
has to seivice Llock location ieguests (which it stoies in memoiy, making them veiy
ellicient) anu uoes not, loi example, seive uata, which woulu guickly Lecome a Lot-
tleneck as the numLei ol clients giew.
Network Topology and Hadoop
Vhat uoes it mean loi two noues in a local netwoik to Le ¨close¨ to each othei? In the
context ol high-volume uata piocessing, the limiting lactoi is the iate at which we can
tianslei uata Letween noues÷Lanuwiuth is a scaice commouity. The iuea is to use the
Lanuwiuth Letween two noues as a measuie ol uistance.
Rathei than measuiing Lanuwiuth Letween noues, which can Le uillicult to uo in piac-
tice (it ieguiies a guiet clustei, anu the numLei ol paiis ol noues in a clustei giows as
the sguaie ol the numLei ol noues), Hauoop takes a simple appioach in which the
netwoik is iepiesenteu as a tiee anu the uistance Letween two noues is the sum ol theii
uistances to theii closest common ancestoi. Levels in the tiee aie not pieuelineu, Lut
it is common to have levels that coiiesponu to the uata centei, the iack, anu the noue
that a piocess is iunning on. The iuea is that the Lanuwiuth availaLle loi each ol the
lollowing scenaiios Lecomes piogiessively less:
º Piocesses on the same noue
º Dilleient noues on the same iack
º Noues on uilleient iacks in the same uata centei
º Noues in uilleient uata centeis
Foi example, imagine a noue n1 on iack r1 in uata centei d1. This can Le iepiesenteu
as /d1/r1/n1. Using this notation, heie aie the uistances loi the loui scenaiios:
º distancc(/d1/r1/n1, /d1/r1/n1) = 0 (piocesses on the same noue)
º distancc(/d1/r1/n1, /d1/r1/n2) = 2 (uilleient noues on the same iack)
º distancc(/d1/r1/n1, /d1/r2/n3) = + (noues on uilleient iacks in the same uata centei)
º distancc(/d1/r1/n1, /d2/r3/n1) = 6 (noues in uilleient uata centeis)
7. At the time ol this wiiting, Hauoop is not suiteu loi iunning acioss uata centeis.
Data Flow | 71
This is illustiateu schematically in Figuie 3-3. (Mathematically inclineu ieaueis will
notice that this is an example ol a uistance metiic.)
Finally, it is impoitant to iealize that Hauoop cannot uivine youi netwoik topology loi
you. It neeus some help; we`ll covei how to conliguie topology in ¨Netwoik Topol-
ogy¨ on page 297. By uelault though, it assumes that the netwoik is llat÷a single-level
hieiaichy÷oi in othei woius, that all noues aie on a single iack in a single uata centei.
Foi small clusteis, this may actually Le the case, anu no luithei conliguiation is
Iigurc 3-3. Nctwor| distancc in Hadoop
Anatomy of a File Write
Next we`ll look at how liles aie wiitten to HDFS. Although guite uetaileu, it is instiuc-
tive to unueistanu the uata llow since it claiilies HDFS`s coheiency mouel.
The case we`ie going to consiuei is the case ol cieating a new lile, wiiting uata to it,
then closing the lile. See Figuie 3-+.
The client cieates the lile Ly calling create() on DistributedFileSystem (step 1 in
Figuie 3-+). DistributedFileSystem makes an RPC call to the namenoue to cieate a new
lile in the lilesystem`s namespace, with no Llocks associateu with it (step 2). The name-
noue peiloims vaiious checks to make suie the lile uoesn`t alieauy exist, anu that the
client has the iight peimissions to cieate the lile. Il these checks pass, the namenoue
makes a iecoiu ol the new lile; otheiwise, lile cieation lails anu the client is thiown an
IOException. The DistributedFileSystem ietuins an FSDataOutputStream loi the client
72 | Chapter 3: The Hadoop Distributed Filesystem
to stait wiiting uata to. ]ust as in the ieau case, FSDataOutputStream wiaps a DFSOutput
Stream, which hanules communication with the uatanoues anu namenoue.
As the client wiites uata (step 3), DFSOutputStream splits it into packets, which it wiites
to an inteinal gueue, calleu the data qucuc. The uata gueue is consumeu Ly the Data
Streamer, whose iesponsiLility it is to ask the namenoue to allocate new Llocks Ly
picking a list ol suitaLle uatanoues to stoie the ieplicas. The list ol uatanoues loims a
pipeline÷we`ll assume the ieplication level is thiee, so theie aie thiee noues in the
pipeline. The DataStreamer stieams the packets to the liist uatanoue in the pipeline,
which stoies the packet anu loiwaius it to the seconu uatanoue in the pipeline. Simi-
laily, the seconu uatanoue stoies the packet anu loiwaius it to the thiiu (anu last)
uatanoue in the pipeline (step +).
Iigurc 3-1. A c|icnt writing data to HDIS
DFSOutputStream also maintains an inteinal gueue ol packets that aie waiting to Le
acknowleugeu Ly uatanoues, calleu the ac| qucuc. A packet is iemoveu liom the ack
gueue only when it has Leen acknowleugeu Ly all the uatanoues in the pipeline (step 5).
Il a uatanoue lails while uata is Leing wiitten to it, then the lollowing actions aie taken,
which aie tianspaient to the client wiiting the uata. Fiist the pipeline is closeu, anu any
packets in the ack gueue aie auueu to the liont ol the uata gueue so that uatanoues
that aie uownstieam liom the laileu noue will not miss any packets. The cuiient Llock
on the goou uatanoues is given a new iuentity, which is communicateu to the name-
noue, so that the paitial Llock on the laileu uatanoue will Le ueleteu il the laileu
Data Flow | 73
uatanoue iecoveis latei on. The laileu uatanoue is iemoveu liom the pipeline anu the
iemainuei ol the Llock`s uata is wiitten to the two goou uatanoues in the pipeline. The
namenoue notices that the Llock is unuei-ieplicateu, anu it aiianges loi a luithei ieplica
to Le cieateu on anothei noue. SuLseguent Llocks aie then tieateu as noimal.
It`s possiLle, Lut unlikely, that multiple uatanoues lail while a Llock is Leing wiitten.
As long as dfs.replication.min ieplicas (uelault one) aie wiitten, the wiite will succeeu,
anu the Llock will Le asynchionously ieplicateu acioss the clustei until its taiget iep-
lication lactoi is ieacheu (dfs.replication, which uelaults to thiee).
Vhen the client has linisheu wiiting uata, it calls close() on the stieam (step 6). This
action llushes all the iemaining packets to the uatanoue pipeline anu waits loi ac-
knowleugments Leloie contacting the namenoue to signal that the lile is complete (step
7). The namenoue alieauy knows which Llocks the lile is maue up ol (via Data
Streamer asking loi Llock allocations), so it only has to wait loi Llocks to Le minimally
ieplicateu Leloie ietuining successlully.
Replica Placement
How uoes the namenoue choose which uatanoues to stoie ieplicas on? Theie`s a tiaue-
oll Letween ieliaLility anu wiite Lanuwiuth anu ieau Lanuwiuth heie. Foi example,
placing all ieplicas on a single noue incuis the lowest wiite Lanuwiuth penalty since
the ieplication pipeline iuns on a single noue, Lut this olleis no ieal ieuunuancy (il the
noue lails, the uata loi that Llock is lost). Also, the ieau Lanuwiuth is high loi oll-iack
ieaus. At the othei extieme, placing ieplicas in uilleient uata centeis may maximize
ieuunuancy, Lut at the cost ol Lanuwiuth. Even in the same uata centei (which is what
all Hauoop clusteis to uate have iun in), theie aie a vaiiety ol placement stiategies.
Inueeu, Hauoop changeu its placement stiategy in ielease 0.17.0 to one that helps keep
a laiily even uistiiLution ol Llocks acioss the clustei. (See ¨Lalancei¨ on page 3+S loi
uetails on keeping a clustei Lalanceu.) Anu liom 0.21.0, Llock placement policies aie
Hauoop`s uelault stiategy is to place the liist ieplica on the same noue as the client (loi
clients iunning outsiue the clustei, a noue is chosen at ianuom, although the system
tiies not to pick noues that aie too lull oi too Lusy). The seconu ieplica is placeu on a
uilleient iack liom the liist (ojj-rac|), chosen at ianuom. The thiiu ieplica is placeu on
the same iack as the seconu, Lut on a uilleient noue chosen at ianuom. Fuithei ieplicas
aie placeu on ianuom noues on the clustei, although the system tiies to avoiu placing
too many ieplicas on the same iack.
Once the ieplica locations have Leen chosen, a pipeline is Luilt, taking netwoik topol-
ogy into account. Foi a ieplication lactoi ol 3, the pipeline might look like Figuie 3-5.
Oveiall, this stiategy gives a goou Lalance among ieliaLility (Llocks aie stoieu on two
iacks), wiite Lanuwiuth (wiites only have to tiaveise a single netwoik switch), ieau
peiloimance (theie`s a choice ol two iacks to ieau liom), anu Llock uistiiLution acioss
the clustei (clients only wiite a single Llock on the local iack).
74 | Chapter 3: The Hadoop Distributed Filesystem
Coherency Model
A coheiency mouel loi a lilesystem uesciiLes the uata visiLility ol ieaus anu wiites loi
a lile. HDFS tiaues oll some POSIX ieguiiements loi peiloimance, so some opeiations
may Lehave uilleiently than you expect them to.
Altei cieating a lile, it is visiLle in the lilesystem namespace, as expecteu:
Path p = new Path("p");
assertThat(fs.exists(p), is(true));
Howevei, any content wiitten to the lile is not guaianteeu to Le visiLle, even il the
stieam is llusheu. So the lile appeais to have a length ol zeio:
Path p = new Path("p");
OutputStream out = fs.create(p);
assertThat(fs.getFileStatus(p).getLen(), is(0L));
Once moie than a Llock`s woith ol uata has Leen wiitten, the liist Llock will Le visiLle
to new ieaueis. This is tiue ol suLseguent Llocks, too: it is always the cuiient Llock
Leing wiitten that is not visiLle to othei ieaueis.
Iigurc 3-5. A typica| rcp|ica pipc|inc
Data Flow | 75
HDFS pioviues a methou loi loicing all Lulleis to Le synchionizeu to the uatanoues
via the sync() methou on FSDataOutputStream. Altei a successlul ietuin liom sync(),
HDFS guaiantees that the uata wiitten up to that point in the lile is peisisteu anu visiLle
to all new ieaueis:
Path p = new Path("p");
FSDataOutputStream out = fs.create(p);
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
This Lehavioi is similai to the fsync system call in POSIX that commits Lulleieu uata
loi a lile uesciiptoi. Foi example, using the stanuaiu ]ava API to wiite a local lile, we
aie guaianteeu to see the content altei llushing the stieam anu synchionizing:
FileOutputStream out = new FileOutputStream(localFile);
out.flush(); // flush to operating system
out.getFD().sync(); // sync to disk
assertThat(localFile.length(), is(((long) "content".length())));
Closing a lile in HDFS peiloims an implicit sync(), too:
Path p = new Path("p");
OutputStream out = fs.create(p);
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));
Consequences for application design
This coheiency mouel has implications loi the way you uesign applications. Vith no
calls to sync(), you shoulu Le piepaieu to lose up to a Llock ol uata in the event ol
client oi system lailuie. Foi many applications, this is unacceptaLle, so you shoulu call
sync() at suitaLle points, such as altei wiiting a ceitain numLei ol iecoius oi numLei
ol Lytes. Though the sync() opeiation is uesigneu to not unuuly tax HDFS, it uoes have
some oveiheau, so theie is a tiaue-oll Letween uata ioLustness anu thioughput. Vhat
is an acceptaLle tiaue-oll is application-uepenuent, anu suitaLle values can Le selecteu
altei measuiing youi application`s peiloimance with uilleient sync() lieguencies.
Parallel Copying with distcp
The HDFS access patteins that we have seen so lai locus on single-thieaueu access. It`s
possiLle to act on a collection ol liles, Ly specilying lile gloLs, loi example, Lut loi
ellicient, paiallel piocessing ol these liles you woulu have to wiite a piogiam youisell.
S. Fiom ielease 0.21.0 sync() is uepiecateu in lavoi ol hflush(), which only guaiantees that new ieaueis
will see all uata wiitten to that point, anu hsync(), which makes a stiongei guaiantee that the opeiating
system has llusheu the uata to uisk (like POSIX fsync), although uata may still Le in the uisk cache.
76 | Chapter 3: The Hadoop Distributed Filesystem
Hauoop comes with a uselul piogiam calleu distcp loi copying laige amounts ol uata
to anu liom Hauoop lilesystems in paiallel.
The canonical use case loi distcp is loi tiansleiiing uata Letween two HDFS clusteis.
Il the clusteis aie iunning iuentical veisions ol Hauoop, the hdjs scheme is
% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar
This will copy the /joo uiiectoiy (anu its contents) liom the liist clustei to the /bar
uiiectoiy on the seconu clustei, so the seconu clustei enus up with the uiiectoiy stiuc-
tuie /bar/joo. Il /bar uoesn`t exist, it will Le cieateu liist. You can specily multiple souice
paths, anu all will Le copieu to the uestination. Souice paths must Le aLsolute.
By uelault, distcp will skip liles that alieauy exist in the uestination, Lut they can Le
oveiwiitten Ly supplying the -overwrite option. You can also upuate only liles that
have changeu using the -update option.
Using eithei (oi Loth) ol -overwrite oi -update changes how the souice
anu uestination paths aie inteipieteu. This is Lest shown with an ex-
ample. Il we changeu a lile in the /joo suLtiee on the liist clustei liom
the pievious example, then we coulu synchionize the change with the
seconu clustei Ly iunning:
% hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo
The extia tiailing /joo suLuiiectoiy is neeueu on the uestination, as now
the contcnts ol the souice uiiectoiy aie copieu to the contcnts ol the
uestination uiiectoiy. (Il you aie lamiliai with rsync, you can think ol
the -overwrite oi -update options as auuing an implicit tiailing slash to
the souice.)
Il you aie unsuie ol the ellect ol a distcp opeiation, it is a goou iuea to
tiy it out on a small test uiiectoiy tiee liist.
Theie aie moie options to contiol the Lehavioi ol distcp, incluuing ones to pieseive lile
attiiLutes, ignoie lailuies, anu limit the numLei ol liles oi total uata copieu. Run it with
no options to see the usage instiuctions.
distcp is implementeu as a MapReuuce joL wheie the woik ol copying is uone Ly the
maps that iun in paiallel acioss the clustei. Theie aie no ieuuceis. Each lile is copieu
Ly a single map, anu distcp tiies to give each map appioximately the same amount ol
uata, Ly Lucketing liles into ioughly egual allocations.
The numLei ol maps is ueciueu as lollows. Since it`s a goou iuea to get each map to
copy a ieasonaLle amount ol uata to minimize oveiheaus in task setup, each map copies
at least 256 MB (unless the total size ol the input is less, in which case one map hanules
it all). Foi example, 1 GB ol liles will Le given loui map tasks. Vhen the uata size is
veiy laige, it Lecomes necessaiy to limit the numLei ol maps in oiuei to limit Lanuwiuth
anu clustei utilization. By uelault, the maximum numLei ol maps is 20 pei (tasktiackei)
Parallel Copying with distcp | 77
clustei noue. Foi example, copying 1,000 GB ol liles to a 100-noue clustei will allocate
2,000 maps (20 pei noue), so each will copy 512 MB on aveiage. This can Le ieuuceu
Ly specilying the -m aigument to distcp. Foi example, -m 1000 woulu allocate 1,000
maps, each copying 1 GB on aveiage.
Il you tiy to use distcp Letween two HDFS clusteis that aie iunning uilleient veisions,
the copy will lail il you use the hdjs piotocol, since the RPC systems aie incompatiLle.
To iemeuy this, you can use the ieau-only HTTP-Laseu HFTP lilesystem to ieau liom
the souice. The joL must iun on the uestination clustei so that the HDFS RPC veisions
aie compatiLle. To iepeat the pievious example using HFTP:
% hadoop distcp hftp://namenode1:50070/foo hdfs://namenode2/bar
Note that you neeu to specily the namenoue`s weL poit in the souice URI. This is
ueteimineu Ly the dfs.http.address piopeity, which uelaults to 50070.
Using the newei wcbhdjs piotocol (which ieplaces hjtp) it is possiLle to use HTTP loi
Loth the souice anu uestination clusteis without hitting any wiie incompatiLility pioL-
% hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
Anothei vaiiant is to use an HDFS HTTP pioxy as the distcp souice oi uestination,
which has the auvantage ol Leing aLle to set liiewall anu Lanuwiuth contiols÷see
¨HTTP¨ on page 55.
Keeping an HDFS Cluster Balanced
Vhen copying uata into HDFS, it`s impoitant to consiuei clustei Lalance. HDFS woiks
Lest when the lile Llocks aie evenly spieau acioss the clustei, so you want to ensuie
that distcp uoesn`t uisiupt this. Going Lack to the 1,000 GB example, Ly specilying -m
1 a single map woulu uo the copy, which÷apait liom Leing slow anu not using the
clustei iesouices elliciently÷woulu mean that the liist ieplica ol each Llock woulu
iesiue on the noue iunning the map (until the uisk lilleu up). The seconu anu thiiu
ieplicas woulu Le spieau acioss the clustei, Lut this one noue woulu Le unLalanceu.
By having moie maps than noues in the clustei, this pioLlem is avoiueu÷loi this iea-
son, it`s Lest to stait Ly iunning distcp with the uelault ol 20 maps pei noue.
Howevei, it`s not always possiLle to pievent a clustei liom Lecoming unLalanceu. Pei-
haps you want to limit the numLei ol maps so that some ol the noues can Le useu Ly
othei joLs. In this case, you can use the ba|anccr tool (see ¨Lalancei¨ on page 3+S) to
suLseguently even out the Llock uistiiLution acioss the clustei.
Hadoop Archives
HDFS stoies small liles inelliciently, since each lile is stoieu in a Llock, anu Llock
metauata is helu in memoiy Ly the namenoue. Thus, a laige numLei ol small liles can
78 | Chapter 3: The Hadoop Distributed Filesystem
eat up a lot ol memoiy on the namenoue. (Note, howevei, that small liles uo not take
up any moie uisk space than is ieguiieu to stoie the iaw contents ol the lile. Foi
example, a 1 MB lile stoieu with a Llock size ol 12S MB uses 1 MB ol uisk space, not
12S MB.)
Hadoop Archivcs, oi HAR liles, aie a lile aichiving lacility that packs liles into HDFS
Llocks moie elliciently, theieLy ieuucing namenoue memoiy usage while still allowing
tianspaient access to liles. In paiticulai, Hauoop Aichives can Le useu as input to
Using Hadoop Archives
A Hauoop Aichive is cieateu liom a collection ol liles using the archivc tool. The tool
iuns a MapReuuce joL to piocess the input liles in paiallel, so to iun it, you neeu a
MapReuuce clustei iunning to use it. Heie aie some liles in HDFS that we woulu like
to aichive:
% hadoop fs -lsr /my/files
-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/a
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files/dir
-rw-r--r-- 1 tom supergroup 1 2009-04-09 19:13 /my/files/dir/b
Now we can iun the archive commanu:
hadoop archive -archiveName files.har /my/files /my
The liist option is the name ol the aichive, heie ji|cs.har. HAR liles always have
a .har extension, which is manuatoiy loi ieasons we shall see latei. Next comes the liles
to put in the aichive. Heie we aie aichiving only one souice tiee, the liles in /ny/ji|cs
in HDFS, Lut the tool accepts multiple souice tiees. The linal aigument is the output
uiiectoiy loi the HAR lile. Let`s see what the aichive has cieateu:
% hadoop fs -ls /my
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files
drwxr-xr-x - tom supergroup 0 2009-04-09 19:13 /my/files.har
% hadoop fs -ls /my/files.har
Found 3 items
-rw-r--r-- 10 tom supergroup 165 2009-04-09 19:13 /my/files.har/_index
-rw-r--r-- 10 tom supergroup 23 2009-04-09 19:13 /my/files.har/_masterindex
-rw-r--r-- 1 tom supergroup 2 2009-04-09 19:13 /my/files.har/part-0
The uiiectoiy listing shows what a HAR lile is maue ol: two inuex liles anu a collection
ol pait liles÷just one in this example. The pait liles contain the contents ol a numLei
ol the oiiginal liles concatenateu togethei, anu the inuexes make it possiLle to look up
the pait lile that an aichiveu lile is containeu in, anu its ollset anu length. All these
uetails aie hiuuen liom the application, howevei, which uses the har URI scheme to
inteiact with HAR liles, using a HAR lilesystem that is layeieu on top ol the unueilying
Hadoop Archives | 79
lilesystem (HDFS in this case). The lollowing commanu iecuisively lists the liles in the
% hadoop fs -lsr har:///my/files.har
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files
-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/a
drw-r--r-- - tom supergroup 0 2009-04-09 19:13 /my/files.har/my/files/dir
-rw-r--r-- 10 tom supergroup 1 2009-04-09 19:13 /my/files.har/my/files/dir/b
This is guite stiaightloiwaiu il the lilesystem that the HAR lile is on is the uelault
lilesystem. On the othei hanu, il you want to ielei to a HAR lile on a uilleient lilesystem,
then you neeu to use a uilleient loim ol the path URI to noimal. These two commanus
have the same ellect, loi example:
% hadoop fs -lsr har:///my/files.har/my/files/dir
% hadoop fs -lsr har://hdfs-localhost:8020/my/files.har/my/files/dir
Notice in the seconu loim that the scheme is still har to signily a HAR lilesystem, Lut
the authoiity is hdjs to specily the unueilying lilesystem`s scheme, lolloweu Ly a uash
anu the HDFS host (localhost) anu poit (S020). Ve can now see why HAR liles have
to have a .har extension. The HAR lilesystem tianslates the har URI into a URI loi the
unueilying lilesystem, Ly looking at the authoiity anu path up to anu incluuing the
component with the .har extension. In this case, it is hdjs://|oca|host:8020/ny/ji|cs
.har. The iemaining pait ol the path is the path ol the lile in the aichive: /ny/ji|cs/dir.
To uelete a HAR lile, you neeu to use the iecuisive loim ol uelete, since liom the
unueilying lilesystem`s point ol view the HAR lile is a uiiectoiy:
hadoop fs -rmr /my/files.har
Theie aie a lew limitations to Le awaie ol with HAR liles. Cieating an aichive cieates
a copy ol the oiiginal liles, so you neeu as much uisk space as the liles you aie aichiving
to cieate the aichive (although you can uelete the oiiginals once you have cieateu the
aichive). Theie is cuiiently no suppoit loi aichive compiession, although the liles that
go into the aichive can Le compiesseu (HAR liles aie like tar liles in this iespect).
Aichives aie immutaLle once they have Leen cieateu. To auu oi iemove liles, you must
ie-cieate the aichive. In piactice, this is not a pioLlem loi liles that uon`t change altei
Leing wiitten, since they can Le aichiveu in Latches on a iegulai Lasis, such as uaily oi
As noteu eailiei, HAR liles can Le useu as input to MapReuuce. Howevei, theie is no
aichive-awaie InputFormat that can pack multiple liles into a single MapReuuce split,
so piocessing lots ol small liles, even in a HAR lile, can still Le inellicient. ¨Small liles
anu ComLineFileInputFoimat¨ on page 237 uiscusses anothei appioach to this
80 | Chapter 3: The Hadoop Distributed Filesystem
Finally, il you aie hitting namenoue memoiy limits even altei taking steps to minimize
the numLei ol small liles in the system, then consiuei using HDFS Feueiation to scale
the namespace (¨HDFS Feueiation¨ on page +9).
Hadoop Archives | 81
Hadoop I/O
Hauoop comes with a set ol piimitives loi uata I/O. Some ol these aie technigues that
aie moie geneial than Hauoop, such as uata integiity anu compiession, Lut ueseive
special consiueiation when uealing with multiteiaLyte uatasets. Otheis aie Hauoop
tools oi APIs that loim the Luiluing Llocks loi ueveloping uistiiLuteu systems, such as
seiialization liamewoiks anu on-uisk uata stiuctuies.
Data Integrity
Useis ol Hauoop iightly expect that no uata will Le lost oi coiiupteu uuiing stoiage oi
piocessing. Howevei, since eveiy I/O opeiation on the uisk oi netwoik caiiies with it
a small chance ol intiouucing eiiois into the uata that it is ieauing oi wiiting, when the
volumes ol uata llowing thiough the system aie as laige as the ones Hauoop is capaLle
ol hanuling, the chance ol uata coiiuption occuiiing is high.
The usual way ol uetecting coiiupteu uata is Ly computing a chcc|sun loi the uata
when it liist enteis the system, anu again whenevei it is tiansmitteu acioss a channel
that is unieliaLle anu hence capaLle ol coiiupting the uata. The uata is ueemeu to Le
coiiupt il the newly geneiateu checksum uoesn`t exactly match the oiiginal. This tech-
nigue uoesn`t ollei any way to lix the uata÷meiely eiioi uetection. (Anu this is a ieason
loi not using low-enu haiuwaie; in paiticulai, Le suie to use ECC memoiy.) Note that
it is possiLle that it`s the checksum that is coiiupt, not the uata, Lut this is veiy unlikely,
since the checksum is much smallei than the uata.
A commonly useu eiioi-uetecting coue is CRC-32 (cyclic ieuunuancy check), which
computes a 32-Lit integei checksum loi input ol any size.
Data Integrity in HDFS
HDFS tianspaiently checksums all uata wiitten to it anu Ly uelault veiilies checksums
when ieauing uata. A sepaiate checksum is cieateu loi eveiy io.bytes.per.checksum
Lytes ol uata. The uelault is 512 Lytes, anu since a CRC-32 checksum is + Lytes long,
the stoiage oveiheau is less than 1º.
Datanoues aie iesponsiLle loi veiilying the uata they ieceive Leloie stoiing the uata
anu its checksum. This applies to uata that they ieceive liom clients anu liom othei
uatanoues uuiing ieplication. A client wiiting uata senus it to a pipeline ol uatanoues
(as explaineu in Chaptei 3), anu the last uatanoue in the pipeline veiilies the checksum.
Il it uetects an eiioi, the client ieceives a ChecksumException, a suLclass ol IOExcep
tion, which it shoulu hanule in an application-specilic mannei, Ly ietiying the opeia-
tion, loi example.
Vhen clients ieau uata liom uatanoues, they veiily checksums as well, compaiing them
with the ones stoieu at the uatanoue. Each uatanoue keeps a peisistent log ol checksum
veiilications, so it knows the last time each ol its Llocks was veiilieu. Vhen a client
successlully veiilies a Llock, it tells the uatanoue, which upuates its log. Keeping sta-
tistics such as these is valuaLle in uetecting Lau uisks.
Asiue liom Llock veiilication on client ieaus, each uatanoue iuns a DataBlockScanner
in a Lackgiounu thieau that peiiouically veiilies all the Llocks stoieu on the uatanoue.
This is to guaiu against coiiuption uue to ¨Lit iot¨ in the physical stoiage meuia. See
¨Datanoue Llock scannei¨ on page 3+7 loi uetails on how to access the scannei
Since HDFS stoies ieplicas ol Llocks, it can ¨heal¨ coiiupteu Llocks Ly copying one ol
the goou ieplicas to piouuce a new, uncoiiupt ieplica. The way this woiks is that il a
client uetects an eiioi when ieauing a Llock, it iepoits the Lau Llock anu the uatanoue
it was tiying to ieau liom to the namenoue Leloie thiowing a ChecksumException. The
namenoue maiks the Llock ieplica as coiiupt, so it uoesn`t uiiect clients to it, oi tiy to
copy this ieplica to anothei uatanoue. It then scheuules a copy ol the Llock to Le ie-
plicateu on anothei uatanoue, so its ieplication lactoi is Lack at the expecteu level.
Once this has happeneu, the coiiupt ieplica is ueleteu.
It is possiLle to uisaLle veiilication ol checksums Ly passing false to the setVerify
Checksum() methou on FileSystem, Leloie using the open() methou to ieau a lile. The
same ellect is possiLle liom the shell Ly using the -ignoreCrc option with the -get oi
the eguivalent -copyToLocal commanu. This leatuie is uselul il you have a coiiupt lile
that you want to inspect so you can ueciue what to uo with it. Foi example, you might
want to see whethei it can Le salvageu Leloie you uelete it.
The Hauoop LocalFileSystem peiloims client-siue checksumming. This means that
when you wiite a lile calleu ji|cnanc, the lilesystem client tianspaiently cieates a hiuuen
lile, .ji|cnanc.crc, in the same uiiectoiy containing the checksums loi each chunk ol
the lile. Like HDFS, the chunk size is contiolleu Ly the io.bytes.per.checksum piopeity,
which uelaults to 512 Lytes. The chunk size is stoieu as metauata in the .crc lile, so the
84 | Chapter 4: Hadoop I/O
lile can Le ieau Lack coiiectly even il the setting loi the chunk size has changeu.
Checksums aie veiilieu when the lile is ieau, anu il an eiioi is uetecteu,
LocalFileSystem thiows a ChecksumException.
Checksums aie laiily cheap to compute (in ]ava, they aie implementeu in native coue),
typically auuing a lew peicent oveiheau to the time to ieau oi wiite a lile. Foi most
applications, this is an acceptaLle piice to pay loi uata integiity. It is, howevei, possiLle
to uisaLle checksums: typically when the unueilying lilesystem suppoits checksums
natively. This is accomplisheu Ly using RawLocalFileSystem in place ol Local
FileSystem. To uo this gloLally in an application, it sullices to iemap the implementa-
tion loi ji|c URIs Ly setting the piopeity fs.file.impl to the value
org.apache.hadoop.fs.RawLocalFileSystem. Alteinatively, you can uiiectly cieate a Raw
LocalFileSystem instance, which may Le uselul il you want to uisaLle checksum veii-
lication loi only some ieaus; loi example:
Configuration conf = ...
FileSystem fs = new RawLocalFileSystem();
fs.initialize(null, conf);
LocalFileSystem uses ChecksumFileSystem to uo its woik, anu this class makes it easy
to auu checksumming to othei (nonchecksummeu) lilesystems, as Checksum
FileSystem is just a wiappei aiounu FileSystem. The geneial iuiom is as lollows:
FileSystem rawFs = ...
FileSystem checksummedFs = new ChecksumFileSystem(rawFs);
The unueilying lilesystem is calleu the raw lilesystem, anu may Le ietiieveu using the
getRawFileSystem() methou on ChecksumFileSystem. ChecksumFileSystem has a lew
moie uselul methous loi woiking with checksums, such as getChecksumFile() loi get-
ting the path ol a checksum lile loi any lile. Check the uocumentation loi the otheis.
Il an eiioi is uetecteu Ly ChecksumFileSystem when ieauing a lile, it will call its
reportChecksumFailure() methou. The uelault implementation uoes nothing, Lut
LocalFileSystem moves the ollenuing lile anu its checksum to a siue uiiectoiy on the
same uevice calleu bad_ji|cs. Auministiatois shoulu peiiouically check loi these Lau
liles anu take action on them.
File compiession Liings two majoi Lenelits: it ieuuces the space neeueu to stoie liles,
anu it speeus up uata tianslei acioss the netwoik, oi to oi liom uisk. Vhen uealing
with laige volumes ol uata, Loth ol these savings can Le signilicant, so it pays to caielully
consiuei how to use compiession in Hauoop.
Compression | 85
Theie aie many uilleient compiession loimats, tools anu algoiithms, each with uillei-
ent chaiacteiistics. TaLle +-1 lists some ol the moie common ones that can Le useu
with Hauoop.
Tab|c 1-1. A sunnary oj conprcssion jornats
Compression format Tool Algorithm Filename extension Splittable
N/A DEFLATE .deflate No
gzip gzip DEFLATE .gz No
bzip2 bzip2 bzip2 .bz2 Yes
LZO lzop LZO .lzo No
Snappy N/A Snappy .snappy No
DEFLATE is a compression algorithm whose standard implementation is zlib. There is no commonly available command-line tool for
producing files in DEFLATE format, as gzip is normally used. (Note that the gzip file format is DEFLATE with extra headers and a footer.)
The .deflate filename extension is a Hadoop convention.
However, LZO files are splittable if they have been indexed in a preprocessing step. See page 91.
All compiession algoiithms exhiLit a space/time tiaue-oll: lastei compiession anu ue-
compiession speeus usually come at the expense ol smallei space savings. The tools
listeu in TaLle +-1 typically give some contiol ovei this tiaue-oll at compiession time
Ly olleiing nine uilleient options: –1 means optimize loi speeu anu -9 means optimize
loi space. Foi example, the lollowing commanu cieates a compiesseu lile ji|c.gz using
the lastest compiession methou:
gzip -1 file
The uilleient tools have veiy uilleient compiession chaiacteiistics. Gzip is a geneial-
puipose compiessoi, anu sits in the miuule ol the space/time tiaue-oll. Bzip2 com-
piesses moie ellectively than gzip, Lut is slowei. Bzip2`s uecompiession speeu is lastei
than its compiession speeu, Lut it is still slowei than the othei loimats. LZO anu
Snappy, on the othei hanu, Loth optimize loi speeu anu aie aiounu an oiuei ol mag-
nituue lastei than gzip, Lut compiess less ellectively. Snappy is also signilicantly lastei
than LZO loi uecompiession.
The ¨SplittaLle¨ column in TaLle +-1 inuicates whethei the compiession loimat sup-
poits splitting; that is, whethei you can seek to any point in the stieam anu stait ieauing
liom some point luithei on. SplittaLle compiession loimats aie especially suitaLle loi
MapReuuce; see ¨Compiession anu Input Splits¨ on page 91 loi luithei uiscussion.
1. Foi a compiehensive set ol compiession Lenchmaiks, https://github.con/ning/jvn-conprcssor
-bcnchnar| is a goou ieleience loi ]MV-compatiLle liLiaiies (incluues some native liLiaiies). Foi
commanu line tools, see ]ell Gilchiist`s Aichive Compaiison Test at http://conprcssion.ca/act/act
86 | Chapter 4: Hadoop I/O
A codcc is the implementation ol a compiession-uecompiession algoiithm. In Hauoop,
a couec is iepiesenteu Ly an implementation ol the CompressionCodec inteilace. So, loi
example, GzipCodec encapsulates the compiession anu uecompiession algoiithm loi
gzip. TaLle +-2 lists the couecs that aie availaLle loi Hauoop.
Tab|c 1-2. Hadoop conprcssion codccs
Compression format Hadoop CompressionCodec
DEFLATE org.apache.hadoop.io.compress.DefaultCodec
gzip org.apache.hadoop.io.compress.GzipCodec
bzip2 org.apache.hadoop.io.compress.BZip2Codec
LZO com.hadoop.compression.lzo.LzopCodec
Snappy org.apache.hadoop.io.compress.SnappyCodec
The LZO liLiaiies aie GPL-licenseu anu may not Le incluueu in Apache uistiiLutions,
so loi this ieason the Hauoop couecs must Le uownloaueu sepaiately liom http://codc
.goog|c.con/p/hadoop-gp|-conprcssion/ (oi http://github.con/|cvinwci|/hadoop-|zo,
which incluues Luglixes anu moie tools). The LzopCodec is compatiLle with the lzop
tool, which is essentially the LZO loimat with extia heaueis, anu is the one you noi-
mally want. Theie is also a LzoCodec loi the puie LZO loimat, which uses the .|zo_dc-
j|atc lilename extension (Ly analogy with DEFLATE, which is gzip without the
Compressing and decompressing streams with CompressionCodec
CompressionCodec has two methous that allow you to easily compiess oi uecompiess
uata. To compiess uata Leing wiitten to an output stieam, use the createOutput
Stream(OutputStream out) methou to cieate a CompressionOutputStream to which you
wiite youi uncompiesseu uata to have it wiitten in compiesseu loim to the unueilying
stieam. Conveisely, to uecompiess uata Leing ieau liom an input stieam, call
createInputStream(InputStream in) to oLtain a CompressionInputStream, which allows
you to ieau uncompiesseu uata liom the unueilying stieam.
CompressionOutputStream anu CompressionInputStream aie similai to
java.util.zip.DeflaterOutputStream anu java.util.zip.DeflaterInputStream, except
that Loth ol the loimei pioviue the aLility to ieset theii unueilying compiessoi oi ue-
compiessoi, which is impoitant loi applications that compiess sections ol the uata
stieam as sepaiate Llocks, such as SequenceFile, uesciiLeu in ¨Seguence-
File¨ on page 132.
Example +-1 illustiates how to use the API to compiess uata ieau liom stanuaiu input
anu wiite it to stanuaiu output.
Compression | 87
Exanp|c 1-1. A progran to conprcss data rcad jron standard input and writc it to standard output
public class StreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);

CompressionOutputStream out = codec.createOutputStream(System.out);
IOUtils.copyBytes(System.in, out, 4096, false);
The application expects the lully gualilieu name ol the CompressionCodec implementa-
tion as the liist commanu-line aigument. Ve use ReflectionUtils to constiuct a new
instance ol the couec, then oLtain a compiession wiappei aiounu System.out. Then we
call the utility methou copyBytes() on IOUtils to copy the input to the output, which
is compiesseu Ly the CompressionOutputStream. Finally, we call finish() on
CompressionOutputStream, which tells the compiessoi to linish wiiting to the com-
piesseu stieam, Lut uoesn`t close the stieam. Ve can tiy it out with the lollowing
commanu line, which compiesses the stiing ¨Text¨ using the StreamCompressor pio-
giam with the GzipCodec, then uecompiesses it liom stanuaiu input using gunzip:
% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \
| gunzip -
Inferring CompressionCodecs using CompressionCodecFactory
Il you aie ieauing a compiesseu lile, you can noimally inlei the couec to use Ly looking
at its lilename extension. A lile enuing in .gz can Le ieau with GzipCodec, anu so on.
The extension loi each compiession loimat is listeu in TaLle +-1.
CompressionCodecFactory pioviues a way ol mapping a lilename extension to a
CompressionCodec using its getCodec() methou, which takes a Path oLject loi the lile in
guestion. Example +-2 shows an application that uses this leatuie to uecompiess liles.
Exanp|c 1-2. A progran to dcconprcss a conprcsscd ji|c using a codcc injcrrcd jron thc ji|c`s
public class FileDecompressor {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);

Path inputPath = new Path(uri);
CompressionCodecFactory factory = new CompressionCodecFactory(conf);
88 | Chapter 4: Hadoop I/O
CompressionCodec codec = factory.getCodec(inputPath);
if (codec == null) {
System.err.println("No codec found for " + uri);
String outputUri =
CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());
InputStream in = null;
OutputStream out = null;
try {
in = codec.createInputStream(fs.open(inputPath));
out = fs.create(new Path(outputUri));
IOUtils.copyBytes(in, out, conf);
} finally {
Once the couec has Leen lounu, it is useu to stiip oll the lile sullix to loim the output
lilename (via the removeSuffix() static methou ol CompressionCodecFactory). In this
way, a lile nameu ji|c.gz is uecompiesseu to ji|c Ly invoking the piogiam as lollows:
% hadoop FileDecompressor file.gz
CompressionCodecFactory linus couecs liom a list uelineu Ly the
io.compression.codecs conliguiation piopeity. By uelault, this lists all the couecs pio-
viueu Ly Hauoop (see TaLle +-3), so you woulu neeu to altei it only il you have a custom
couec that you wish to iegistei (such as the exteinally hosteu LZO couecs). Each couec
knows its uelault lilename extension, thus peimitting CompressionCodecFactory to
seaich thiough the iegisteieu couecs to linu a match loi a given extension (il any).
Tab|c 1-3. Conprcssion codcc propcrtics
Property name Type Default value Description
io.compression.codecs comma-separated
Class names
A list of the
CompressionCodec classes
for compression/
Native libraries
Foi peiloimance, it is pieleiaLle to use a native liLiaiy loi compiession anu
uecompiession. Foi example, in one test, using the native gzip liLiaiies ieuuceu ue-
compiession times Ly up to 50º anu compiession times Ly aiounu 10º (compaieu to
the Luilt-in ]ava implementation). TaLle +-+ shows the availaLility ol ]ava anu native
Compression | 89
implementations loi each compiession loimat. Not all loimats have native implemen-
tations (Lzip2, loi example), wheieas otheis aie only availaLle as a native implemen-
tation (LZO, loi example).
Tab|c 1-1. Conprcssion |ibrary inp|cncntations
Compression format Java implementation Native implementation
gzip Yes Yes
bzip2 Yes No
LZO No Yes
Hauoop comes with pieLuilt native compiession liLiaiies loi 32- anu 6+-Lit Linux,
which you can linu in the |ib/nativc uiiectoiy. Foi othei platloims, you will neeu to
compile the liLiaiies youisell, lollowing the instiuctions on the Hauoop wiki at http://
The native liLiaiies aie pickeu up using the ]ava system piopeity java.library.path.
The hadoop sciipt in the bin uiiectoiy sets this piopeity loi you, Lut il you uon`t use
this sciipt, you will neeu to set the piopeity in youi application.
By uelault, Hauoop looks loi native liLiaiies loi the platloim it is iunning on, anu loaus
them automatically il they aie lounu. This means you uon`t have to change any con-
liguiation settings to use the native liLiaiies. In some ciicumstances, howevei, you may
wish to uisaLle use ol native liLiaiies, such as when you aie ueLugging a compiession-
ielateu pioLlem. You can achieve this Ly setting the piopeity hadoop.native.lib to
false, which ensuies that the Luilt-in ]ava eguivalents will Le useu (il they aie availaLle).
Il you aie using a native liLiaiy anu you aie uoing a lot ol compiession oi
uecompiession in youi application, consiuei using CodecPool, which allows you to ie-
use compiessois anu uecompiessois, theieLy amoitizing the cost ol cieating these
The coue in Example +-3 shows the API, although in this piogiam, which only cieates
a single Compressor, theie is ieally no neeu to use a pool.
Exanp|c 1-3. A progran to conprcss data rcad jron standard input and writc it to standard output
using a poo|cd conprcssor
public class PooledStreamCompressor {
public static void main(String[] args) throws Exception {
String codecClassname = args[0];
Class<?> codecClass = Class.forName(codecClassname);
Configuration conf = new Configuration();
CompressionCodec codec = (CompressionCodec)
ReflectionUtils.newInstance(codecClass, conf);
Compressor compressor = null;
try {
90 | Chapter 4: Hadoop I/O
compressor = CodecPool.getCompressor(codec);
CompressionOutputStream out =
codec.createOutputStream(System.out, compressor);
IOUtils.copyBytes(System.in, out, 4096, false);
} finally {
Ve ietiieve a Compressor instance liom the pool loi a given CompressionCodec, which
we use in the couec`s oveiloaueu createOutputStream() methou. By using a finally
Llock, we ensuie that the compiessoi is ietuineu to the pool even il theie is an
IOException while copying the Lytes Letween the stieams.
Compression and Input Splits
Vhen consiueiing how to compiess uata that will Le piocesseu Ly MapReuuce, it is
impoitant to unueistanu whethei the compiession loimat suppoits splitting. Consiuei
an uncompiesseu lile stoieu in HDFS whose size is 1 GB. Vith an HDFS Llock size ol
6+ MB, the lile will Le stoieu as 16 Llocks, anu a MapReuuce joL using this lile as input
will cieate 16 input splits, each piocesseu inuepenuently as input to a sepaiate map task.
Imagine now the lile is a gzip-compiesseu lile whose compiesseu size is 1 GB. As Leloie,
HDFS will stoie the lile as 16 Llocks. Howevei, cieating a split loi each Llock won`t
woik since it is impossiLle to stait ieauing at an aiLitiaiy point in the gzip stieam, anu
theieloie impossiLle loi a map task to ieau its split inuepenuently ol the otheis. The
gzip loimat uses DEFLATE to stoie the compiesseu uata, anu DEFLATE stoies uata
as a seiies ol compiesseu Llocks. The pioLlem is that the stait ol each Llock is not
uistinguisheu in any way that woulu allow a ieauei positioneu at an aiLitiaiy point in
the stieam to auvance to the Leginning ol the next Llock, theieLy synchionizing itsell
with the stieam. Foi this ieason, gzip uoes not suppoit splitting.
In this case, MapReuuce will uo the iight thing anu not tiy to split the gzippeu lile,
since it knows that the input is gzip-compiesseu (Ly looking at the lilename extension)
anu that gzip uoes not suppoit splitting. This will woik, Lut at the expense ol locality:
a single map will piocess the 16 HDFS Llocks, most ol which will not Le local to the
map. Also, with lewei maps, the joL is less gianulai, anu so may take longei to iun.
Il the lile in oui hypothetical example weie an LZO lile, we woulu have the same
pioLlem since the unueilying compiession loimat uoes not pioviue a way loi a ieauei
to synchionize itsell with the stieam. Howevei, it is possiLle to piepiocess LZO liles
using an inuexei tool that comes with the Hauoop LZO liLiaiies, which you can oLtain
liom the site listeu in ¨Couecs¨ on page S7. The tool Luilus an inuex ol split points,
ellectively making them splittaLle when the appiopiiate MapReuuce input loimat is
Compression | 91
A Lzip2 lile, on the othei hanu, uoes pioviue a synchionization maikei Letween Llocks
(a +S-Lit appioximation ol pi), so it uoes suppoit splitting. (TaLle +-1 lists whethei
each compiession loimat suppoits splitting.)
Which Compression Format Should I Use?
Vhich compiession loimat you shoulu use uepenus on youi application. Do you want
to maximize the speeu ol youi application oi aie you moie conceineu aLout keeping
stoiage costs uown? In geneial, you shoulu tiy uilleient stiategies loi youi application,
anu Lenchmaik them with iepiesentative uatasets to linu the Lest appioach.
Foi laige, unLounueu liles, like logliles, the options aie:
º Stoie the liles uncompiesseu.
º Use a compiession loimat that suppoits splitting, like Lzip2 (although Lzip2 is
laiily slow), oi one that can Le inuexeu to suppoit splitting, like LZO.
º Split the lile into chunks in the application anu compiess each chunk sepaiately
using any suppoiteu compiession loimat (it uoesn`t mattei whethei it is splittaLle).
In this case, you shoulu choose the chunk size so that the compiesseu chunks aie
appioximately the size ol an HDFS Llock.
º Use Seguence File, which suppoits compiession anu splitting. See ¨Seguence-
File¨ on page 132.
º Use an Avio uata lile, which suppoits compiession anu splitting, just like Seguence
File, Lut has the auueu auvantage ol Leing ieauaLle anu wiitaLle liom many
languages, not just ]ava. See ¨Avio uata liles¨ on page 119.
Foi laige liles, you shoulu not use a compiession loimat that uoes not suppoit splitting
on the whole lile, since you lose locality anu make MapReuuce applications veiy
Foi aichival puiposes, consiuei the Hauoop aichive loimat (see ¨Hauoop Ai-
chives¨ on page 7S), although it uoes not suppoit compiession.
Using Compression in MapReduce
As uesciiLeu in ¨Inleiiing CompiessionCouecs using CompiessionCouecFac-
toiy¨ on page SS, il youi input liles aie compiesseu, they will Le automatically
uecompiesseu as they aie ieau Ly MapReuuce, using the lilename extension to uetei-
mine the couec to use.
To compiess the output ol a MapReuuce joL, in the joL conliguiation, set the
mapred.output.compress piopeity to true anu the mapred.output.compression.codec
piopeity to the classname ol the compiession couec you want to use. Alteinatively, you
can use the static convenience methous on FileOutputFormat to set these piopeities as
shown in Example +-+.
92 | Chapter 4: Hadoop I/O
Exanp|c 1-1. App|ication to run thc naxinun tcnpcraturc job producing conprcsscd output
public class MaxTemperatureWithCompression {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperatureWithCompression <input path> " +
"<output path>");
Job job = new Job();
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);


System.exit(job.waitForCompletion(true) ? 0 : 1);
Ve iun the piogiam ovei compiesseu input (which uoesn`t have to use the same com-
piession loimat as the output, although it uoes in this example) as lollows:
% hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz output
Each pait ol the linal output is compiesseu; in this case, theie is a single pait:
% gunzip -c output/part-r-00000.gz
1949 111
1950 22
Il you aie emitting seguence liles loi youi output, then you can set the mapred.out
put.compression.type piopeity to contiol the type ol compiession to use. The uelault
is RECORD, which compiesses inuiviuual iecoius. Changing this to BLOCK, which
compiesses gioups ol iecoius, is iecommenueu since it compiesses Lettei (see ¨The
SeguenceFile loimat¨ on page 13S).
Theie is also a static convenience methou on SequenceFileOutputFormat calleu setOut
putCompressionType() to set this piopeity.
The conliguiation piopeities to set compiession loi MapReuuce joL outputs aie sum-
maiizeu in TaLle +-5. Il youi MapReuuce uiivei uses the Tool inteilace (uesciiLeu in
¨GeneiicOptionsPaisei, Tool, anu ToolRunnei¨ on page 151), then you can pass any
Compression | 93
ol these piopeities to the piogiam on the commanu line, which may Le moie convenient
than mouilying youi piogiam to haiu coue the compiession to use.
Tab|c 1-5. MapRcducc conprcssion propcrtics
Property name Type Default value Description
boolean false Compress outputs.
Class name org.apache.hadoop.io.
The compression codec to use for out-
String RECORD The type of compression to use for Se-
quenceFile outputs: NONE, RECORD, or
Compressing map output
Even il youi MapReuuce application ieaus anu wiites uncompiesseu uata, it may Len-
elit liom compiessing the inteimeuiate output ol the map phase. Since the map output
is wiitten to uisk anu tiansleiieu acioss the netwoik to the ieuucei noues, Ly using a
last compiessoi such as LZO oi Snappy, you can get peiloimance gains simply Lecause
the volume ol uata to tianslei is ieuuceu. The conliguiation piopeities to enaLle com-
piession loi map outputs anu to set the compiession loimat aie shown in TaLle +-6.
Tab|c 1-ó. Map output conprcssion propcrtics
Property name Type Default value Description
mapred.compress.map. output boolean false Compress map outputs.
Class org.apache.hadoop.io.
The compression codec to use for
map outputs.
Heie aie the lines to auu to enaLle gzip map output compiession in youi joL:
Configuration conf = new Configuration();
conf.setBoolean("mapred.compress.map.output", true);
conf.setClass("mapred.map.output.compression.codec", GzipCodec.class,
Job job = new Job(conf);
Scria|ization is the piocess ol tuining stiuctuieu oLjects into a Lyte stieam loi tians-
mission ovei a netwoik oi loi wiiting to peisistent stoiage. Dcscria|ization is the ieveise
piocess ol tuining a Lyte stieam Lack into a seiies ol stiuctuieu oLjects.
Seiialization appeais in two guite uistinct aieas ol uistiiLuteu uata piocessing: loi
inteipiocess communication anu loi peisistent stoiage.
94 | Chapter 4: Hadoop I/O
In Hauoop, inteipiocess communication Letween noues in the system is implementeu
using rcnotc proccdurc ca||s (RPCs). The RPC piotocol uses seiialization to ienuei the
message into a Linaiy stieam to Le sent to the iemote noue, which then ueseiializes the
Linaiy stieam into the oiiginal message. In geneial, it is uesiiaLle that an RPC seiiali-
zation loimat is:
A compact loimat makes the Lest use ol netwoik Lanuwiuth, which is the most
scaice iesouice in a uata centei.
Inteipiocess communication loims the LackLone loi a uistiiLuteu system, so it is
essential that theie is as little peiloimance oveiheau as possiLle loi the seiialization
anu ueseiialization piocess.
Piotocols change ovei time to meet new ieguiiements, so it shoulu Le
stiaightloiwaiu to evolve the piotocol in a contiolleu mannei loi clients anu
seiveis. Foi example, it shoulu Le possiLle to auu a new aigument to a methou
call, anu have the new seiveis accept messages in the olu loimat (without the new
aigument) liom olu clients.
Foi some systems, it is uesiiaLle to Le aLle to suppoit clients that aie wiitten in
uilleient languages to the seivei, so the loimat neeus to Le uesigneu to make this
On the lace ol it, the uata loimat chosen loi peisistent stoiage woulu have uilleient
ieguiiements liom a seiialization liamewoik. Altei all, the lilespan ol an RPC is less
than a seconu, wheieas peisistent uata may Le ieau yeais altei it was wiitten. As it tuins
out, the loui uesiiaLle piopeities ol an RPC`s seiialization loimat aie also ciucial loi a
peisistent stoiage loimat. Ve want the stoiage loimat to Le compact (to make ellicient
use ol stoiage space), last (so the oveiheau in ieauing oi wiiting teiaLytes ol uata is
minimal), extensiLle (so we can tianspaiently ieau uata wiitten in an oluei loimat),
anu inteiopeiaLle (so we can ieau oi wiite peisistent uata using uilleient languages).
Hauoop uses its own seiialization loimat, ViitaLles, which is ceitainly compact anu
last, Lut not so easy to extenu oi use liom languages othei than ]ava. Since ViitaLles
aie cential to Hauoop (most MapReuuce piogiams use them loi theii key anu value
types), we look at them in some uepth in the next thiee sections, Leloie looking at
seiialization liamewoiks in geneial, anu then Avio (a seiialization system that was
uesigneu to oveicome some ol the limitations ol ViitaLles) in moie uetail.
The Writable Interface
The ViitaLle inteilace uelines two methous: one loi wiiting its state to a DataOutput
Linaiy stieam, anu one loi ieauing its state liom a DataInput Linaiy stieam:
Serialization | 95
package org.apache.hadoop.io;

import java.io.DataOutput;
import java.io.DataInput;
import java.io.IOException;
public interface Writable {
void write(DataOutput out) throws IOException;
void readFields(DataInput in) throws IOException;
Let`s look at a paiticulai Writable to see what we can uo with it. Ve will use
IntWritable, a wiappei loi a ]ava int. Ve can cieate one anu set its value using the
set() methou:
IntWritable writable = new IntWritable();
Eguivalently, we can use the constiuctoi that takes the integei value:
IntWritable writable = new IntWritable(163);
To examine the seiializeu loim ol the IntWritable, we wiite a small helpei methou that
wiaps a java.io.ByteArrayOutputStream in a java.io.DataOutputStream (an implemen-
tation ol java.io.DataOutput) to captuie the Lytes in the seiializeu stieam:
public static byte[] serialize(Writable writable) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
return out.toByteArray();
An integei is wiitten using loui Lytes (as we see using ]Unit + asseitions):
byte[] bytes = serialize(writable);
assertThat(bytes.length, is(4));
The Lytes aie wiitten in Lig-enuian oiuei (so the most signilicant Lyte is wiitten to the
stieam liist, this is uictateu Ly the java.io.DataOutput inteilace), anu we can see theii
hexauecimal iepiesentation Ly using a methou on Hauoop`s StringUtils:
assertThat(StringUtils.byteToHexString(bytes), is("000000a3"));
Let`s tiy ueseiialization. Again, we cieate a helpei methou to ieau a Writable oLject
liom a Lyte aiiay:
public static byte[] deserialize(Writable writable, byte[] bytes)
throws IOException {
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
DataInputStream dataIn = new DataInputStream(in);
return bytes;
96 | Chapter 4: Hadoop I/O
Ve constiuct a new, value-less, IntWritable, then call deserialize() to ieau liom the
output uata that we just wiote. Then we check that its value, ietiieveu using the
get() methou, is the oiiginal value, 163:
IntWritable newWritable = new IntWritable();
deserialize(newWritable, bytes);
assertThat(newWritable.get(), is(163));
WritableComparable and comparators
IntWritable implements the WritableComparable inteilace, which is just a suLinteilace
ol the Writable anu java.lang.Comparable inteilaces:
package org.apache.hadoop.io;

public interface WritableComparable<T> extends Writable, Comparable<T> {
Compaiison ol types is ciucial loi MapReuuce, wheie theie is a soiting phase uuiing
which keys aie compaieu with one anothei. One optimization that Hauoop pioviues
is the RawComparator extension ol ]ava`s Comparator:
package org.apache.hadoop.io;

import java.util.Comparator;
public interface RawComparator<T> extends Comparator<T> {

public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2);

This inteilace peimits implementois to compaie iecoius ieau liom a stieam without
ueseiializing them into oLjects, theieLy avoiuing any oveiheau ol oLject cieation. Foi
example, the compaiatoi loi IntWritables implements the iaw compare() methou Ly
ieauing an integei liom each ol the Lyte aiiays b1 anu b2 anu compaiing them uiiectly,
liom the given stait positions (s1 anu s2) anu lengths (l1 anu l2).
WritableComparator is a geneial-puipose implementation ol RawComparator loi
WritableComparable classes. It pioviues two main lunctions. Fiist, it pioviues a uelault
implementation ol the iaw compare() methou that ueseiializes the oLjects to Le com-
paieu liom the stieam anu invokes the oLject compare() methou. Seconu, it acts as a
lactoiy loi RawComparator instances (that Writable implementations have iegisteieu).
Foi example, to oLtain a compaiatoi loi IntWritable, we just use:
RawComparator<IntWritable> comparator = WritableComparator.get(IntWritable.class);
The compaiatoi can Le useu to compaie two IntWritable oLjects:
IntWritable w1 = new IntWritable(163);
IntWritable w2 = new IntWritable(67);
assertThat(comparator.compare(w1, w2), greaterThan(0));
oi theii seiializeu iepiesentations:
Serialization | 97
byte[] b1 = serialize(w1);
byte[] b2 = serialize(w2);
assertThat(comparator.compare(b1, 0, b1.length, b2, 0, b2.length),
Writable Classes
Hauoop comes with a laige selection ol Writable classes in the org.apache.hadoop.io
package. They loim the class hieiaichy shown in Figuie +-1.
Writable wrappers for Java primitives
Theie aie Writable wiappeis loi all the ]ava piimitive types (see TaLle +-7) except
char (which can Le stoieu in an IntWritable). All have a get() anu a set() methou loi
ietiieving anu stoiing the wiappeu value.
98 | Chapter 4: Hadoop I/O
Iigurc 1-1. Writab|c c|ass hicrarchy
Tab|c 1-7. Writab|c wrappcr c|asscs jor java prinitivcs
Java primitive Writable implementation Serialized size (bytes)
boolean BooleanWritable 1
byte ByteWritable 1
short ShortWritable 2
int IntWritable 4
VIntWritable 1–5
float FloatWritable 4
long LongWritable 8
Serialization | 99
Java primitive Writable implementation Serialized size (bytes)
VLongWritable 1–9
double DoubleWritable 8
Vhen it comes to encouing integeis, theie is a choice Letween the lixeu-length loimats
(IntWritable anu LongWritable) anu the vaiiaLle-length loimats (VIntWritable anu
VLongWritable). The vaiiaLle-length loimats use only a single Lyte to encoue the value
il it is small enough (Letween ÷112 anu 127, inclusive); otheiwise, they use the liist
Lyte to inuicate whethei the value is positive oi negative, anu how many Lytes lollow.
Foi example, 163 ieguiies two Lytes:
byte[] data = serialize(new VIntWritable(163));
assertThat(StringUtils.byteToHexString(data), is("8fa3"));
How uo you choose Letween a lixeu-length anu a vaiiaLle-length encouing? Fixeu-
length encouings aie goou when the uistiiLution ol values is laiily uniloim acioss the
whole value space, such as a (well-uesigneu) hash lunction. Most numeiic vaiiaLles
tenu to have nonuniloim uistiiLutions, anu on aveiage the vaiiaLle-length encouing
will save space. Anothei auvantage ol vaiiaLle-length encouings is that you can switch
liom VIntWritable to VLongWritable, since theii encouings aie actually the same. So Ly
choosing a vaiiaLle-length iepiesentation, you have ioom to giow without committing
to an S-Lyte long iepiesentation liom the Leginning.
Text is a Writable loi UTF-S seguences. It can Le thought ol as the Writable eguivalent
ol java.lang.String. Text is a ieplacement loi the UTF8 class, which was uepiecateu
Lecause it uiun`t suppoit stiings whose encouing was ovei 32,767 Lytes, anu Lecause
it useu ]ava`s mouilieu UTF-S.
The Text class uses an int (with a vaiiaLle-length encouing) to stoie the numLei ol
Lytes in the stiing encouing, so the maximum value is 2 GB. Fuitheimoie, Text uses
stanuaiu UTF-S, which makes it potentially easiei to inteiopeiate with othei tools that
unueistanu UTF-S.
Because ol its emphasis on using stanuaiu UTF-S, theie aie some uilleiences
Letween Text anu the ]ava String class. Inuexing loi the Text class is in teims ol position
in the encoueu Lyte seguence, not the Unicoue chaiactei in the stiing, oi the ]ava
char coue unit (as it is loi String). Foi ASCII stiings, these thiee concepts ol inuex
position coinciue. Heie is an example to uemonstiate the use ol the charAt() methou:
Text t = new Text("hadoop");
assertThat(t.getLength(), is(6));
assertThat(t.getBytes().length, is(6));

assertThat(t.charAt(2), is((int) 'd'));
assertThat("Out of bounds", t.charAt(100), is(-1));
100 | Chapter 4: Hadoop I/O
Notice that charAt() ietuins an int iepiesenting a Unicoue coue point, unlike the
String vaiiant that ietuins a char. Text also has a find() methou, which is analogous
to String`s indexOf():
Text t = new Text("hadoop");
assertThat("Find a substring", t.find("do"), is(2));
assertThat("Finds first 'o'", t.find("o"), is(3));
assertThat("Finds 'o' from position 4 or later", t.find("o", 4), is(4));
assertThat("No match", t.find("pig"), is(-1));
Vhen we stait using chaiacteis that aie encoueu with moie than a single Lyte,
the uilleiences Letween Text anu String Lecome cleai. Consiuei the Unicoue chaiacteis
shown in TaLle +-S.
Tab|c 1-8. Unicodc charactcrs
Unicode code point U+0041 U+00DF U+6771 U+10400
N/A (a unified
Han ideograph)
UTF-8 code units 41 c3 9f e6 9d b1 f0 90 90 80
Java representation \u0041 \u00DF \u6771 \uuD801\uDC00
All Lut the last chaiactei in the taLle, U-10+00, can Le expiesseu using a single ]ava
char. U-10+00 is a supplementaiy chaiactei anu is iepiesenteu Ly two ]ava chars,
known as a suiiogate paii. The tests in Example +-5 show the uilleiences Letween
String anu Text when piocessing a stiing ol the loui chaiacteis liom TaLle +-S.
Exanp|c 1-5. Tcsts showing thc dijjcrcnccs bctwccn thc String and Tcxt c|asscs
public class StringTextComparisonTest {
public void string() throws UnsupportedEncodingException {

String s = "\u0041\u00DF\u6771\uD801\uDC00";
assertThat(s.length(), is(5));
assertThat(s.getBytes("UTF-8").length, is(10));

assertThat(s.indexOf("\u0041"), is(0));
assertThat(s.indexOf("\u00DF"), is(1));
assertThat(s.indexOf("\u6771"), is(2));
assertThat(s.indexOf("\uD801\uDC00"), is(3));

assertThat(s.charAt(0), is('\u0041'));
assertThat(s.charAt(1), is('\u00DF'));
assertThat(s.charAt(2), is('\u6771'));
assertThat(s.charAt(3), is('\uD801'));
assertThat(s.charAt(4), is('\uDC00'));

2. This example is Laseu on one liom the aiticle Supplementaiy Chaiacteis in the ]ava Platloim.
Serialization | 101
assertThat(s.codePointAt(0), is(0x0041));
assertThat(s.codePointAt(1), is(0x00DF));
assertThat(s.codePointAt(2), is(0x6771));
assertThat(s.codePointAt(3), is(0x10400));

public void text() {

Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
assertThat(t.getLength(), is(10));

assertThat(t.find("\u0041"), is(0));
assertThat(t.find("\u00DF"), is(1));
assertThat(t.find("\u6771"), is(3));
assertThat(t.find("\uD801\uDC00"), is(6));
assertThat(t.charAt(0), is(0x0041));
assertThat(t.charAt(1), is(0x00DF));
assertThat(t.charAt(3), is(0x6771));
assertThat(t.charAt(6), is(0x10400));
The test conliims that the length ol a String is the numLei ol char coue units it contains
(5, one liom each ol the liist thiee chaiacteis in the stiing, anu a suiiogate paii liom
the last), wheieas the length ol a Text oLject is the numLei ol Lytes in its UTF-S encouing
(10 = 1-2-3-+). Similaily, the indexOf() methou in String ietuins an inuex in char
coue units, anu find() loi Text is a Lyte ollset.
The charAt() methou in String ietuins the char coue unit loi the given inuex, which
in the case ol a suiiogate paii will not iepiesent a whole Unicoue chaiactei. The code
PointAt() methou, inuexeu Ly char coue unit, is neeueu to ietiieve a single Unicoue
chaiactei iepiesenteu as an int. In lact, the charAt() methou in Text is moie like the
codePointAt() methou than its namesake in String. The only uilleience is that it is
inuexeu Ly Lyte ollset.
Iteiating ovei the Unicoue chaiacteis in Text is complicateu Ly the use ol Lyte
ollsets loi inuexing, since you can`t just inciement the inuex. The iuiom loi iteiation
is a little oLscuie (see Example +-6): tuin the Text oLject into a java.nio.ByteBuffer,
then iepeateuly call the bytesToCodePoint() static methou on Text with the Lullei. This
methou extiacts the next coue point as an int anu upuates the position in the Lullei.
The enu ol the stiing is uetecteu when bytesToCodePoint() ietuins ÷1.
Exanp|c 1-ó. |tcrating ovcr thc charactcrs in a Tcxt objcct
public class TextIterator {

public static void main(String[] args) {
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");

ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
102 | Chapter 4: Hadoop I/O
int cp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1) {
Running the piogiam piints the coue points loi the loui chaiacteis in the stiing:
% hadoop TextIterator
Anothei uilleience with String is that Text is mutaLle (like all Writable im-
plementations in Hauoop, except NullWritable, which is a singleton). You can ieuse a
Text instance Ly calling one ol the set() methous on it. Foi example:
Text t = new Text("hadoop");
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3));
In some situations, the Lyte aiiay ietuineu Ly the getBytes() methou
may Le longei than the length ietuineu Ly getLength():
Text t = new Text("hadoop");
t.set(new Text("pig"));
assertThat(t.getLength(), is(3));
assertThat("Byte length not shortened", t.getBytes().length,
This shows why it is impeiative that you always call getLength() when
calling getBytes(), so you know how much ol the Lyte aiiay is valiu uata.
Text uoesn`t have as iich an API loi manipulating stiings as
java.lang.String, so in many cases, you neeu to conveit the Text oLject to a String.
This is uone in the usual way, using the toString() methou:
assertThat(new Text("hadoop").toString(), is("hadoop"));
BytesWritable is a wiappei loi an aiiay ol Linaiy uata. Its seiializeu loimat is an integei
lielu (+ Lytes) that specilies the numLei ol Lytes to lollow, lolloweu Ly the Lytes them-
selves. Foi example, the Lyte aiiay ol length two with values 3 anu 5 is seiializeu as a
+-Lyte integei (00000002) lolloweu Ly the two Lytes liom the aiiay (03 anu 05):
BytesWritable b = new BytesWritable(new byte[] { 3, 5 });
byte[] bytes = serialize(b);
assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));
Resorting to String.
Serialization | 103
BytesWritable is mutaLle, anu its value may Le changeu Ly calling its set() methou.
As with Text, the size ol the Lyte aiiay ietuineu liom the getBytes() methou loi Byte
sWritable÷the capacity÷may not iellect the actual size ol the uata stoieu in the
BytesWritable. You can ueteimine the size ol the BytesWritable Ly calling get
Length(). To uemonstiate:
assertThat(b.getLength(), is(2));
assertThat(b.getBytes().length, is(11));
NullWritable is a special type ol Writable, as it has a zeio-length seiialization. No Lytes
aie wiitten to, oi ieau liom, the stieam. It is useu as a placeholuei; loi example, in
MapReuuce, a key oi a value can Le ueclaieu as a NullWritable when you uon`t neeu
to use that position÷it ellectively stoies a constant empty value. NullWritable can also
Le uselul as a key in SequenceFile when you want to stoie a list ol values, as opposeu
to key-value paiis. It is an immutaLle singleton: the instance can Le ietiieveu Ly calling
ObjectWritable and GenericWritable
ObjectWritable is a geneial-puipose wiappei loi the lollowing: ]ava piimitives, String,
enum, Writable, null, oi aiiays ol any ol these types. It is useu in Hauoop RPC to maishal
anu unmaishal methou aiguments anu ietuin types.
ObjectWritable is uselul when a lielu can Le ol moie than one type: loi example, il the
values in a SequenceFile have multiple types, then you can ueclaie the value type as an
ObjectWritable anu wiap each type in an ObjectWritable. Being a geneial-puipose
mechanism, it`s laiily wastelul ol space since it wiites the classname ol the wiappeu
type eveiy time it is seiializeu. In cases wheie the numLei ol types is small anu known
aheau ol time, this can Le impioveu Ly having a static aiiay ol types, anu using the
inuex into the aiiay as the seiializeu ieleience to the type. This is the appioach that
GenericWritable takes, anu you have to suLclass it to specily the types to suppoit.
Writable collections
Theie aie six Writable collection types in the org.apache.hadoop.io package: Array
Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable, SortedMapWrita
ble, anu EnumSetWritable.
ArrayWritable anu TwoDArrayWritable aie Writable implementations loi aiiays anu
two-uimensional aiiays (aiiay ol aiiays) ol Writable instances. All the elements ol an
ArrayWritable oi a TwoDArrayWritable must Le instances ol the same class, which is
specilieu at constiuction, as lollows:
ArrayWritable writable = new ArrayWritable(Text.class);
104 | Chapter 4: Hadoop I/O
In contexts wheie the Writable is uelineu Ly type, such as in SequenceFile keys oi
values, oi as input to MapReuuce in geneial, you neeu to suLclass ArrayWritable (oi
TwoDArrayWritable, as appiopiiate) to set the type statically. Foi example:
public class TextArrayWritable extends ArrayWritable {
public TextArrayWritable() {
ArrayWritable anu TwoDArrayWritable Loth have get() anu set() methous, as well as a
toArray() methou, which cieates a shallow copy ol the aiiay (oi 2D aiiay).
ArrayPrimitiveWritable is a wiappei loi aiiays ol ]ava piimitives. The component type
is uetecteu when you call set(), so theie is no neeu to suLclass to set the type.
MapWritable anu SortedMapWritable aie implementations ol java.util.Map<Writable,
Writable> anu java.util.SortedMap<WritableComparable, Writable>, iespectively. The
type ol each key anu value lielu is a pait ol the seiialization loimat loi that lielu. The
type is stoieu as a single Lyte that acts as an inuex into an aiiay ol types. The aiiay is
populateu with the stanuaiu types in the org.apache.hadoop.io package, Lut custom
Writable types aie accommouateu, too, Ly wiiting a heauei that encoues the type aiiay
loi nonstanuaiu types. As they aie implementeu, MapWritable anu SortedMapWritable
use positive byte values loi custom types, so a maximum ol 127 uistinct nonstanuaiu
Writable classes can Le useu in any paiticulai MapWritable oi SortedMapWritable in-
stance. Heie`s a uemonstiation ol using a MapWritable with uilleient types loi keys anu
MapWritable src = new MapWritable();
src.put(new IntWritable(1), new Text("cat"));
src.put(new VIntWritable(2), new LongWritable(163));

MapWritable dest = new MapWritable();
WritableUtils.cloneInto(dest, src);
assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat")));
assertThat((LongWritable) dest.get(new VIntWritable(2)), is(new
Conspicuous Ly theii aLsence aie Writable collection implementations loi sets anu
lists. A geneial set can Le emulateu Ly using a MapWritable (oi a SortedMapWritable loi
a soiteu set), with NullWritable values. Theie is also EnumSetWritable loi sets ol enum
types. Foi lists ol a single type ol Writable, ArrayWritable is aueguate, Lut to stoie
uilleient types ol Writable in a single list, you can use GenericWritable to wiap the
elements in an ArrayWritable. Alteinatively, you coulu wiite a geneial ListWritable
using the iueas liom MapWritable.
Implementing a Custom Writable
Hauoop comes with a uselul set ol Writable implementations that seive most puiposes;
howevei, on occasion, you may neeu to wiite youi own custom implementation. Vith
Serialization | 105
a custom Writable, you have lull contiol ovei the Linaiy iepiesentation anu the soit
oiuei. Because Writables aie at the heait ol the MapReuuce uata path, tuning the Linaiy
iepiesentation can have a signilicant ellect on peiloimance. The stock Writable
implementations that come with Hauoop aie well-tuneu, Lut loi moie elaLoiate stiuc-
tuies, it is olten Lettei to cieate a new Writable type, iathei than compose the stock
To uemonstiate how to cieate a custom Writable, we shall wiite an implementation
that iepiesents a paii ol stiings, calleu TextPair. The Lasic implementation is shown
in Example +-7.
Exanp|c 1-7. A Writab|c inp|cncntation that storcs a pair oj Tcxt objccts
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair> {
private Text first;
private Text second;

public TextPair() {
set(new Text(), new Text());

public TextPair(String first, String second) {
set(new Text(first), new Text(second));

public TextPair(Text first, Text second) {
set(first, second);

public void set(Text first, Text second) {
this.first = first;
this.second = second;

public Text getFirst() {
return first;
public Text getSecond() {
return second;
public void write(DataOutput out) throws IOException {
106 | Chapter 4: Hadoop I/O
public void readFields(DataInput in) throws IOException {

public int hashCode() {
return first.hashCode() * 163 + second.hashCode();

public boolean equals(Object o) {
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
return false;
public String toString() {
return first + "\t" + second;

public int compareTo(TextPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
return second.compareTo(tp.second);
The liist pait ol the implementation is stiaightloiwaiu: theie aie two Text instance
vaiiaLles, first anu second, anu associateu constiuctois, getteis, anu setteis. All
Writable implementations must have a uelault constiuctoi so that the MapReuuce
liamewoik can instantiate them, then populate theii lielus Ly calling readFields().
ViitaLle instances aie mutaLle anu olten ieuseu, so you shoulu take caie to avoiu
allocating oLjects in the write() oi readFields() methous.
TextPair`s write() methou seiializes each Text oLject in tuin to the output stieam, Ly
uelegating to the Text oLjects themselves. Similaily, readFields() ueseiializes the Lytes
liom the input stieam Ly uelegating to each Text oLject. The DataOutput anu
DataInput inteilaces have a iich set ol methous loi seiializing anu ueseiializing ]ava
piimitives, so, in geneial, you have complete contiol ovei the wiie loimat ol youi
Writable oLject.
]ust as you woulu loi any value oLject you wiite in ]ava, you shoulu oveiiiue the
hashCode(), equals(), anu toString() methous liom java.lang.Object. The hash
Code() methou is useu Ly the HashPartitioner (the uelault paititionei in MapReuuce)
Serialization | 107
to choose a ieuuce paitition, so you shoulu make suie that you wiite a goou hash
lunction that mixes well to ensuie ieuuce paititions aie ol a similai size.
Il you evei plan to use youi custom Writable with TextOutputFormat,
then you must implement its toString() methou. TextOutputFormat calls
toString() on keys anu values loi theii output iepiesentation. Foi Text
Pair, we wiite the unueilying Text oLjects as stiings sepaiateu Ly a taL
TextPair is an implementation ol WritableComparable, so it pioviues an implementation
ol the compareTo() methou that imposes the oiueiing you woulu expect: it soits Ly the
liist stiing lolloweu Ly the seconu. Notice that TextPair uilleis liom TextArrayWrita
ble liom the pievious section (apait liom the numLei ol Text oLjects it can stoie), since
TextArrayWritable is only a Writable, not a WritableComparable.
Implementing a RawComparator for speed
The coue loi TextPair in Example +-7 will woik as it stanus; howevei, theie is a luithei
optimization we can make. As explaineu in ¨ViitaLleCompaiaLle anu compaia-
tois¨ on page 97, when TextPair is Leing useu as a key in MapReuuce, it will have to
Le ueseiializeu into an oLject loi the compareTo() methou to Le invokeu. Vhat il it weie
possiLle to compaie two TextPair oLjects just Ly looking at theii seiializeu
It tuins out that we can uo this, since TextPair is the concatenation ol two Text oLjects,
anu the Linaiy iepiesentation ol a Text oLject is a vaiiaLle-length integei containing
the numLei ol Lytes in the UTF-S iepiesentation ol the stiing, lolloweu Ly the UTF-S
Lytes themselves. The tiick is to ieau the initial length, so we know how long the liist
Text oLject`s Lyte iepiesentation is; then we can uelegate to Text`s RawComparator, anu
invoke it with the appiopiiate ollsets loi the liist oi seconu stiing. Example +-S gives
the uetails (note that this coue is nesteu in the TextPair class).
Exanp|c 1-8. A RawConparator jor conparing TcxtPair bytc rcprcscntations
public static class Comparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public Comparator() {
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {

try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
108 | Chapter 4: Hadoop I/O
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
if (cmp != 0) {
return cmp;
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);
static {
WritableComparator.define(TextPair.class, new Comparator());
Ve actually suLclass WritableComparator iathei than implement RawComparator ui-
iectly, since it pioviues some convenience methous anu uelault implementations. The
suLtle pait ol this coue is calculating firstL1 anu firstL2, the lengths ol the liist
Text lielu in each Lyte stieam. Each is maue up ol the length ol the vaiiaLle-length
integei (ietuineu Ly decodeVIntSize() on WritableUtils) anu the value it is encouing
(ietuineu Ly readVInt()).
The static Llock iegisteis the iaw compaiatoi so that whenevei MapReuuce sees the
TextPair class, it knows to use the iaw compaiatoi as its uelault compaiatoi.
Custom comparators
As we can see with TextPair, wiiting iaw compaiatois takes some caie, since you have
to ueal with uetails at the Lyte level. It is woith looking at some ol the implementations
ol Writable in the org.apache.hadoop.io package loi luithei iueas, il you neeu to wiite
youi own. The utility methous on WritableUtils aie veiy hanuy, too.
Custom compaiatois shoulu also Le wiitten to Le RawComparators, il possiLle. These
aie compaiatois that implement a uilleient soit oiuei to the natuial soit oiuei uelineu
Ly the uelault compaiatoi. Example +-9 shows a compaiatoi loi TextPair, calleu First
Comparator, that consiueis only the liist stiing ol the paii. Note that we oveiiiue the
compare() methou that takes oLjects so Loth compare() methous have the same
Ve will make use ol this compaiatoi in Chaptei S, when we look at joins anu seconuaiy
soiting in MapReuuce (see ¨]oins¨ on page 2S1).
Exanp|c 1-9. A custon RawConparator jor conparing thc jirst jic|d oj TcxtPair bytc rcprcscntations
public static class FirstComparator extends WritableComparator {

private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();

public FirstComparator() {
Serialization | 109
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2) {

try {
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
} catch (IOException e) {
throw new IllegalArgumentException(e);

public int compare(WritableComparable a, WritableComparable b) {
if (a instanceof TextPair && b instanceof TextPair) {
return ((TextPair) a).first.compareTo(((TextPair) b).first);
return super.compare(a, b);
Serialization Frameworks
Although most MapReuuce piogiams use Writable key anu value types, this isn`t man-
uateu Ly the MapReuuce API. In lact, any types can Le useu; the only ieguiiement is
that theie Le a mechanism that tianslates to anu liom a Linaiy iepiesentation ol each
To suppoit this, Hauoop has an API loi pluggaLle seiialization liamewoiks. A seiiali-
zation liamewoik is iepiesenteu Ly an implementation ol Serialization (in the
org.apache.hadoop.io.serializer package). WritableSerialization, loi example, is
the implementation ol Serialization loi Writable types.
A Serialization uelines a mapping liom types to Serializer instances (loi tuining an
oLject into a Lyte stieam) anu Deserializer instances (loi tuining a Lyte stieam into
an oLject).
Set the io.serializations piopeity to a comma-sepaiateu list ol classnames to iegistei
Serialization implementations. Its uelault value incluues org.apache.hadoop.io.seri
alizer.WritableSerialization anu the Avio specilic anu iellect seiializations, which
means that only Writable oi Avio oLjects can Le seiializeu oi ueseiializeu out ol the Lox.
Hauoop incluues a class calleu JavaSerialization that uses ]ava OLject Seiialization.
Although it makes it convenient to Le aLle to use stanuaiu ]ava types in MapReuuce
piogiams, like Integer oi String, ]ava OLject Seiialization is not as ellicient as Viita-
Lles, so it`s not woith making this tiaue-oll (see the siueLai on the next page).
110 | Chapter 4: Hadoop I/O
Why Not Use Java Object Serialization?
]ava comes with its own seiialization mechanism, calleu ]ava OLject Seiialization (olten
ieleiieu to simply as ¨]ava Seiialization¨), that is tightly integiateu with the language,
so it`s natuial to ask why this wasn`t useu in Hauoop. Heie`s what Doug Cutting saiu
in iesponse to that guestion:
Vhy uiun`t I use Seiialization when we liist staiteu Hauoop? Because it lookeu
Lig anu haiiy anu I thought we neeueu something lean anu mean, wheie we hau
piecise contiol ovei exactly how oLjects aie wiitten anu ieau, since that is cential
to Hauoop. Vith Seiialization you can get some contiol, Lut you have to light loi
The logic loi not using RMI was similai. Ellective, high-peiloimance intei-piocess
communications aie ciitical to Hauoop. I lelt like we`u neeu to piecisely contiol
how things like connections, timeouts anu Lulleis aie hanuleu, anu RMI gives you
little contiol ovei those.
The pioLlem is that ]ava Seiialization uoesn`t meet the ciiteiia loi a seiialization loimat
listeu eailiei: compact, last, extensiLle, anu inteiopeiaLle.
]ava Seiialization is not compact: it wiites the classname ol each oLject Leing wiitten
to the stieam÷this is tiue ol classes that implement java.io.Serializable oi
java.io.Externalizable. SuLseguent instances ol the same class wiite a ieleience han-
ule to the liist occuiience, which occupies only 5 Lytes. Howevei, ieleience hanules
uon`t woik well with ianuom access, since the ieleient class may occui at any point in
the pieceuing stieam÷that is, theie is state stoieu in the stieam. Even woise, ieleience
hanules play havoc with soiting iecoius in a seiializeu stieam, since the liist iecoiu ol
a paiticulai class is uistinguisheu anu must Le tieateu as a special case.
All these pioLlems aie avoiueu Ly not wiiting the classname to the stieam at all, which
is the appioach that ViitaLle takes. This makes the assumption that the client knows
the expecteu type. The iesult is that the loimat is consiueiaLly moie compact than ]ava
Seiialization, anu ianuom access anu soiting woik as expecteu since each iecoiu is
inuepenuent ol the otheis (so theie is no stieam state).
]ava Seiialization is a geneial-puipose mechanism loi seiializing giaphs ol oLjects, so
it necessaiily has some oveiheau loi seiialization anu ueseiialization opeiations. Vhat`s
moie, the ueseiialization pioceuuie cieates a new instance loi each oLject ueseiializeu
liom the stieam. ViitaLle oLjects, on the othei hanu, can Le (anu olten aie) ieuseu.
Foi example, loi a MapReuuce joL, which at its coie seiializes anu ueseiializes Lillions
ol iecoius ol just a hanulul ol uilleient types, the savings gaineu Ly not having to allocate
new oLjects aie signilicant.
In teims ol extensiLility, ]ava Seiialization has some suppoit loi evolving a type, Lut it
is Liittle anu haiu to use ellectively (ViitaLles have no suppoit: the piogiammei has
to manage them himsell).
In piinciple, othei languages coulu inteipiet the ]ava Seiialization stieam piotocol (ue-
lineu Ly the ]ava OLject Seiialization Specilication), Lut in piactice theie aie no wiuely
Serialization | 111
useu implementations in othei languages, so it is a ]ava-only solution. The situation is
the same loi ViitaLles.
Serialization IDL
Theie aie a numLei ol othei seiialization liamewoiks that appioach the pioLlem in a
uilleient way: iathei than uelining types thiough coue, you ueline them in a language-
neutial, ueclaiative lashion, using an intcrjacc dcscription |anguagc (IDL). The system
can then geneiate types loi uilleient languages, which is goou loi inteiopeiaLility. They
also typically ueline veisioning schemes that make type evolution stiaightloiwaiu.
Hauoop`s own Recoiu I/O (lounu in the org.apache.hadoop.record package) has an
IDL that is compileu into ViitaLle oLjects, which makes it convenient loi geneiating
types that aie compatiLle with MapReuuce. Foi whatevei ieason, howevei, Recoiu
I/O was not wiuely useu, anu has Leen uepiecateu in lavoi ol Avio.
Apache Thiilt anu Google Piotocol Bulleis aie Loth populai seiialization liamewoiks,
anu they aie commonly useu as a loimat loi peisistent Linaiy uata. Theie is limiteu
suppoit loi these as MapReuuce loimats;
howevei, they aie useu inteinally in paits
ol Hauoop loi RPC anu uata exchange.
In the next section, we look at Avio, an IDL-Laseu seiialization liamewoik uesigneu
to woik well with laige-scale uata piocessing in Hauoop.
Apache Avio
is a language-neutial uata seiialization system. The pioject was cieateu
Ly Doug Cutting (the cieatoi ol Hauoop) to auuiess the majoi uownsiue ol Hauoop
ViitaLles: lack ol language poitaLility. Having a uata loimat that can Le piocesseu Ly
many languages (cuiiently C, C--, C=, ]ava, Python, anu RuLy) makes it easiei to
shaie uatasets with a wiuei auuience than one tieu to a single language. It is also moie
lutuie-piool, allowing uata to potentially outlive the language useu to ieau anu wiite it.
But why a new uata seiialization system? Avio has a set ol leatuies that, taken togethei,
uilleientiate it liom othei systems like Apache Thiilt oi Google`s Piotocol Bulleis.
Like these systems anu otheis, Avio uata is uesciiLeu using a language-inuepenuent
schcna. Howevei, unlike some othei systems, coue geneiation is optional in Avio,
which means you can ieau anu wiite uata that conloims to a given schema even il youi
3. You can linu the latest status loi a Thiilt Serialization at https://issucs.apachc.org/jira/browsc/HADOOP
-3787, anu a Piotocol Bulleis Serialization at https://issucs.apachc.org/jira/browsc/HADOOP-3788.
Twittei`s Elephant Biiu pioject (http://github.con/|cvinwci|/c|cphant-bird) incluues tools loi woiking
with Piotocol Bulleis in Hauoop.
+. Nameu altei the Biitish aiicialt manulactuiei liom the 20th centuiy.
5. Avio also peiloims lavoiaLly compaieu to othei seiialization liLiaiies, as the Lenchmaiks at http://codc
.goog|c.con/p/thrijt-protobuj-conparc/ uemonstiate.
112 | Chapter 4: Hadoop I/O
coue has not seen that paiticulai schema Leloie. To achieve this, Avio assumes that
the schema is always piesent÷at Loth ieau anu wiite time÷which makes loi a veiy
compact encouing, since encoueu values uo not neeu to Le taggeu with a lielu iuentiliei.
Avio schemas aie usually wiitten in ]SON, anu uata is usually encoueu using a Linaiy
loimat, Lut theie aie othei options, too. Theie is a highei-level language calleu Avio
IDL, loi wiiting schemas in a C-like language that is moie lamiliai to uevelopeis. Theie
is also a ]SON-Laseu uata encouei, which, Leing human-ieauaLle, is uselul loi pioto-
typing anu ueLugging Avio uata.
The Avro spccijication piecisely uelines the Linaiy loimat that all implementations must
suppoit. It also specilies many ol the othei leatuies ol Avio that implementations
shoulu suppoit. One aiea that the specilication uoes not iule on, howevei, is APIs:
implementations have complete latituue in the API they expose loi woiking with Avio
uata, since each one is necessaiily language-specilic. The lact that theie is only one
Linaiy loimat is signilicant, since it means the Laiiiei loi implementing a new language
Linuing is lowei, anu avoius the pioLlem ol a comLinatoiial explosion ol languages
anu loimats, which woulu haim inteiopeiaLility.
Avio has iich schcna rcso|ution capaLilities. Vithin ceitain caielully uelineu con-
stiaints, the schema useu to ieau uata neeu not Le iuentical to the schema that was useu
to wiite the uata. This is the mechanism Ly which Avio suppoits schema evolution.
Foi example, a new, optional lielu may Le auueu to a iecoiu Ly ueclaiing it in the
schema useu to ieau the olu uata. New anu olu clients alike will Le aLle to ieau the olu
uata, while new clients can wiite new uata that uses the new lielu. Conveisely, il an olu
client sees newly encoueu uata, it will giacelully ignoie the new lielu anu caiiy on
piocessing as it woulu have uone with olu uata.
Avio specilies an objcct containcr jornat loi seguences ol oLjects÷similai to Hauoop`s
seguence lile. An Avro data ji|c has a metauata section wheie the schema is stoieu,
which makes the lile sell-uesciiLing. Avio uata liles suppoit compiession anu aie split-
taLle, which is ciucial loi a MapReuuce uata input loimat. Fuitheimoie, since Avio
was uesigneu with MapReuuce in minu, in the lutuie it will Le possiLle to use Avio to
Liing liist-class MapReuuce APIs (that is, ones that aie iichei than Stieaming, like the
]ava API, oi C-- Pipes) to languages that speak Avio.
Avio can Le useu loi RPC, too, although this isn`t coveieu heie. Moie inloimation is
in the specilication.
Avro data types and schemas
Avio uelines a small numLei ol uata types, which can Le useu to Luilu application-
specilic uata stiuctuies Ly wiiting schemas. Foi inteiopeiaLility, implementations must
suppoit all Avio types.
Avio`s piimitive types aie listeu in TaLle +-9. Each piimitive type may also Le specilieu
using a moie veiLose loim, using the type attiiLute, such as:
Serialization | 113
{ "type": "null" }
Tab|c 1-9. Avro prinitivc typcs
Type Description Schema
null The absence of a value "null"
boolean A binary value "boolean"
int 32-bit signed integer "int"
long 64-bit signed integer "long"
float Single precision (32-bit) IEEE 754 floating-point number "float"
double Double precision (64-bit) IEEE 754 floating-point number "double"
bytes Sequence of 8-bit unsigned bytes "bytes"
string Sequence of Unicode characters "string"
Avio also uelines the complex types listeu in TaLle +-10, along with a iepiesentative
example ol a schema ol each type.
Tab|c 1-10. Avro conp|cx typcs
Type Description Schema example
array An ordered collection of objects. All objects in a partic-
ular array must have the same schema.
"type": "array",
"items": "long"
map An unordered collection of key-value pairs. Keys must
be strings, values may be any type, although within a
particular map all values must have the same schema.
"type": "map",
"values": "string"
record A collection of named fields of any type.
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
enum A set of named values.
"type": "enum",
"name": "Cutlery",
"doc": "An eating utensil.",
"symbols": ["KNIFE", "FORK", "SPOON"]
fixed A fixed number of 8-bit unsigned bytes.
"type": "fixed",
"name": "Md5Hash",
"size": 16
union A union of schemas. A union is represented by a JSON
array, where each element in the array is a schema.
114 | Chapter 4: Hadoop I/O
Type Description Schema example
Data represented by a union must match one of the
schemas in the union.
{"type": "map", "values": "string"}
Each Avio language API has a iepiesentation loi each Avio type that is specilic to the
language. Foi example, Avio`s double type is iepiesenteu in C, C--, anu ]ava Ly a
double, in Python Ly a float, anu in RuLy Ly a Float.
Vhat`s moie, theie may Le moie than one iepiesentation, oi mapping, loi a language.
All languages suppoit a uynamic mapping, which can Le useu even when the schema
is not known aheau ol iun time. ]ava calls this the gcncric mapping.
In auuition, the ]ava anu C-- implementations can geneiate coue to iepiesent the uata
loi an Avio schema. Coue geneiation, which is calleu the spccijic mapping in ]ava, is
an optimization that is uselul when you have a copy ol the schema Leloie you ieau oi
wiite uata. Geneiateu classes also pioviue a moie uomain-oiienteu API loi usei coue
than geneiic ones.
]ava has a thiiu mapping, the rcj|cct mapping, which maps Avio types onto pieexisting
]ava types, using iellection. It is slowei than the geneiic anu specilic mappings, anu is
not geneially iecommenueu loi new applications.
]ava`s type mappings aie shown in TaLle +-11. As the taLle shows, the specilic mapping
is the same as the geneiic one unless otheiwise noteu (anu the iellect one is the same
as the specilic one unless noteu). The specilic mapping only uilleis liom the geneiic
one loi record, enum, anu fixed, all ol which have geneiateu classes (the name ol which
is contiolleu Ly the name anu optional namespace attiiLute).
Serialization | 115
Vhy uon`t the ]ava geneiic anu specilic mappings use ]ava String to
iepiesent an Avio string? The answei is elliciency: the Avio Utf8 type
is mutaLle, so it may Le ieuseu loi ieauing oi wiiting a seiies ol values.
Also, ]ava String uecoues UTF-S at oLject constiuction time, while Avio
Utf8 uoes it lazily, which can inciease peiloimance in some cases.
Utf8 implements ]ava`s java.lang.CharSequence inteilace, which allows
some inteiopeiaLility with ]ava liLiaiies. In othei cases it may Le nec-
essaiy to conveit Utf8 instances to String oLjects Ly calling its
toString() methou.
Fiom Avio 1.6.0 onwaius theie is an option to have Avio always pei-
loim the conveision to String. Theie aie a couple ol ways to achieve
this. The liist is to set the avro.java.string piopeity in the schema to
{ "type": "string", "avro.java.string": "String" }
Alteinatively, loi the specilic mapping you can geneiate classes which
have String-Laseu getteis anu setteis. Vhen using the Avio Maven plu-
gin this is uone Ly setting the conliguiation piopeity stringType to
String (the example coue has a uemonstiation ol this).
Finally, note that the ]ava iellect mapping always uses String oLjects,
since it is uesigneu loi ]ava compatiLility, not peiloimance.
Tab|c 1-11. Avro java typc nappings
Avro type Generic Java mapping Specific Java mapping Reflect Java mapping
null null type
boolean boolean
int int short or int
long long
float float
double double
bytes java.nio.ByteBuffer Array of byte
string org.apache.avro.
or java.lang.String
array org.apache.avro.
Array or java.util.Collection
map java.util.Map
record org.apache.avro.
Generated class implementing
Arbitrary user class with a zero-
argument constructor. All inherited
nontransient instance fields are used.
116 | Chapter 4: Hadoop I/O
Avro type Generic Java mapping Specific Java mapping Reflect Java mapping
enum java.lang.String Generated Java enum Arbitrary Java enum
fixed org.apache.avro.
Generated class implementing
union java.lang.Object
In-memory serialization and deserialization
Avio pioviues APIs loi seiialization anu ueseiialization, which aie uselul when you
want to integiate Avio with an existing system, such as a messaging system wheie the
liaming loimat is alieauy uelineu. In othei cases, consiuei using Avio`s uata lile loimat.
Let`s wiite a ]ava piogiam to ieau anu wiite Avio uata to anu liom stieams. Ve`ll stait
with a simple Avio schema loi iepiesenting a paii ol stiings as a iecoiu:
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
Il this schema is saveu in a lile on the classpath calleu StringPair.avsc (.avsc is the
conventional extension loi an Avio schema), then we can loau it using the lollowing
two lines ol coue:
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));
Ve can cieate an instance ol an Avio iecoiu using the geneiic API as lollows:
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");
Next, we seiialize the iecoiu to an output stieam:
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);
Theie aie two impoitant oLjects heie: the DatumWriter anu the Encoder. A
DatumWriter tianslates uata oLjects into the types unueistoou Ly an Encoder, which the
lattei wiites to the output stieam. Heie we aie using a GenericDatumWriter, which passes
Serialization | 117
the lielus ol GenericRecord to the Encoder. Ve pass a null to the encouei lactoiy since
we aie not ieusing a pieviously constiucteu encouei heie.
In this example only one oLject is wiitten to the stieam, Lut we coulu call write() with
moie oLjects Leloie closing the stieam il we wanteu to.
The GenericDatumWriter neeus to Le passeu the schema since it lollows the schema to
ueteimine which values liom the uata oLjects to wiite out. Altei we have calleu the
wiitei`s write() methou, we llush the encouei, then close the output stieam.
Ve can ieveise the piocess anu ieau the oLject Lack liom the Lyte Lullei:
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
Ve pass null to the calls to binaryDecoder() anu read() since we aie not ieusing oLjects
heie (the uecouei oi the iecoiu, iespectively).
The oLjects ietuineu Ly result.get("left") anu result.get("left") aie ol type Utf8,
so we conveit them into ]ava String oLjects Ly calling theii toString() methous.
Let`s look now at the eguivalent coue using the specilic API. Ve can
geneiate the StringPair class liom the schema lile Ly using Avio`s Maven plugin loi
compiling schemas. The lollowing is the ielevant pait ol the Maven POM:
The specific API.
118 | Chapter 4: Hadoop I/O
As an alteinative to Maven you can use Avio`s Ant task, org.apache.avro.spe
cific.SchemaTask, oi the Avio commanu line tools
to geneiate ]ava coue loi a schema.
In the coue loi seiializing anu ueseiializing, insteau ol a GenericRecord we constiuct a
StringPair instance, which we wiite to the stieam using a SpecificDatumWriter, anu
ieau Lack using a SpecificDatumReader:
StringPair datum = new StringPair();
datum.left = "L";
datum.right = "R";
ByteArrayOutputStream out = new ByteArrayOutputStream();
DatumWriter<StringPair> writer =
new SpecificDatumWriter<StringPair>(StringPair.class);
Encoder encoder = EncoderFactory.get().binaryEncoder(out, null);
writer.write(datum, encoder);

DatumReader<StringPair> reader =
new SpecificDatumReader<StringPair>(StringPair.class);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
StringPair result = reader.read(null, decoder);
assertThat(result.left.toString(), is("L"));
assertThat(result.right.toString(), is("R"));
Avro data files
Avio`s oLject containei lile loimat is loi stoiing seguences ol Avio oLjects. It is veiy
similai in uesign to Hauoop`s seguence liles, which aie uesciiLeu in ¨Seguence-
File¨ on page 132. The main uilleience is that Avio uata liles aie uesigneu to Le poitaLle
acioss languages, so, loi example, you can wiite a lile in Python anu ieau it in C (we
will uo exactly this in the next section).
A uata lile has a heauei containing metauata, incluuing the Avio schema anu a sync
nar|cr, lolloweu Ly a seiies ol (optionally compiesseu) Llocks containing the seiializeu
Avio oLjects. Blocks aie sepaiateu Ly a sync maikei that is unigue to the lile (the maikei
loi a paiticulai lile is lounu in the heauei) anu that peimits iapiu iesynchionization
with a Llock Lounuaiy altei seeking to an aiLitiaiy point in the lile, such as an HDFS
Llock Lounuaiy. Thus, Avio uata liles aie splittaLle, which makes them amenaLle to
ellicient MapReuuce piocessing.
Viiting Avio oLjects to a uata lile is similai to wiiting to a stieam. Ve use a
DatumWriter, as Leloie, Lut insteau ol using an Encoder, we cieate a DataFileWriter
6. Avio can Le uownloaueu in Loth souice anu Linaiy loims liom http://avro.apachc.org/rc|cascs.htn|. Get
usage instiuctions loi the Avio tools Ly typing java -jar avro-tools-*.jar.
Serialization | 119
instance with the DatumWriter. Then we can cieate a new uata lile (which, Ly conven-
tion, has a .avro extension) anu appenu oLjects to it:
File file = new File("data.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter =
new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
The oLjects that we wiite to the uata lile must conloim to the lile`s schema, otheiwise
an exception will Le thiown when we call append().
This example uemonstiates wiiting to a local lile (java.io.File in the pievious snippet),
Lut we can wiite to any java.io.OutputStream Ly using the oveiloaueu create() methou
on DataFileWriter. To wiite a lile to HDFS, loi example, get an OutputStream Ly calling
create() on FileSystem (see ¨Viiting Data¨ on page 62).
Reauing Lack oLjects liom a uata lile is similai to the eailiei case ol ieauing oLjects
liom an in-memoiy stieam, with one impoitant uilleience: we uon`t have to specily a
schema since it is ieau liom the lile metauata. Inueeu, we can get the schema liom the
DataFileReader instance, using getSchema(), anu veiily that it is the same as the one we
useu to wiite the oiiginal oLject with:
DatumReader<GenericRecord> reader = new GenericDatumReader<GenericRecord>();
DataFileReader<GenericRecord> dataFileReader =
new DataFileReader<GenericRecord>(file, reader);
assertThat("Schema is the same", schema, is(dataFileReader.getSchema()));
DataFileReader is a iegulai ]ava iteiatoi, so we can iteiate thiough its uata oLjects Ly
calling its hasNext() anu next() methous. The lollowing snippet checks that theie is
only one iecoiu, anu that it has the expecteu lielu values:
assertThat(dataFileReader.hasNext(), is(true));
GenericRecord result = dataFileReader.next();
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
assertThat(dataFileReader.hasNext(), is(false));
Rathei than using the usual next() methou, howevei, it is pieleiaLle to use the ovei-
loaueu loim that takes an instance ol the oLject to Le ietuineu (in this case,
GenericRecord), since it will ieuse the oLject anu save allocation anu gaiLage collection
costs loi liles containing many oLjects. The lollowing is iuiomatic:
GenericRecord record = null;
while (dataFileReader.hasNext()) {
record = dataFileReader.next(record);
// process record
Il oLject ieuse is not impoitant, you can use this shoitei loim:
120 | Chapter 4: Hadoop I/O
for (GenericRecord record : dataFileReader) {
// process record
Foi the geneial case ol ieauing a lile on a Hauoop lile system, use Avio`s FsInput to
specily the input lile using a Hauoop Path oLject. DataFileReader actually olleis ianuom
access to Avio uata lile (via its seek() anu sync() methous); howevei, in many cases,
seguential stieaming access is sullicient, loi which DataFileStream shoulu Le useu.
DataFileStream can ieau liom any ]ava InputStream.
To uemonstiate Avio`s language inteiopeiaLility, let`s wiite a uata lile using one
language (Python) anu ieau it Lack with anothei (C).
The piogiam in Example +-10 ieaus comma-sepaiateu stiings liom stanuaiu
input anu wiites them as StringPair iecoius to an Avio uata lile. Like the ]ava coue loi
wiiting a uata lile, we cieate a DatumWriter anu a DataFileWriter oLject. Notice that we
have emLeuueu the Avio schema in the coue, although we coulu egually well have ieau
it liom a lile.
Python iepiesents Avio iecoius as uictionaiies; each line that is ieau liom stanuaiu in
is tuineu into a dict oLject anu appenueu to the DataFileWriter.
Exanp|c 1-10. A Python progran jor writing Avro rccord pairs to a data ji|c
import os
import string
import sys
from avro import schema
from avro import io
from avro import datafile
if __name__ == '__main__':
if len(sys.argv) != 2:
sys.exit('Usage: %s <data_file>' % sys.argv[0])
avro_file = sys.argv[1]
writer = open(avro_file, 'wb')
datum_writer = io.DatumWriter()
schema_object = schema.parse("""\
{ "type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
dfw = datafile.DataFileWriter(writer, datum_writer, schema_object)
for line in sys.stdin.readlines():
(left, right) = string.split(line.strip(), ',')
Python API.
Serialization | 121
dfw.append({'left':left, 'right':right});
Beloie we can iun the piogiam, we neeu to install Avio loi Python:
% easy_install avro
To iun the piogiam, we specily the name ol the lile to wiite output to (pairs.avro) anu
senu input paiis ovei stanuaiu in, maiking the enu ol lile Ly typing Contiol-D:
% python avro/src/main/py/write_pairs.py pairs.avro
Next we`ll tuin to the C API anu wiite a piogiam to uisplay the contents ol
pairs.avro; see Example +-11.
Exanp|c 1-11. A C progran jor rcading Avro rccord pairs jron a data ji|c
#include <avro.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char *argv[]) {
if (argc != 2) {
fprintf(stderr, "Usage: dump_pairs <data_file>\n");

const char *avrofile = argv[1];
avro_schema_error_t error;
avro_file_reader_t filereader;
avro_datum_t pair;
avro_datum_t left;
avro_datum_t right;
int rval;
char *p;
avro_file_reader(avrofile, &filereader);
while (1) {
rval = avro_file_reader_read(filereader, NULL, &pair);
if (rval) break;
if (avro_record_get(pair, "left", &left) == 0) {
avro_string_get(left, &p);
fprintf(stdout, "%s,", p);
if (avro_record_get(pair, "right", &right) == 0) {
avro_string_get(right, &p);
fprintf(stdout, "%s\n", p);
7. Foi the geneial case, the Avio tools ]AR lile has a tojson commanu that uumps the contents ol a Avio
uata lile as ]SON.
122 | Chapter 4: Hadoop I/O
return 0;
The coie ol the piogiam uoes thiee things:
1. opens a lile ieauei ol type avro_file_reader_t Ly calling Avio`s
avro_file_reader lunction,
2. ieaus Avio uata liom the lile ieauei with the avro_file_reader_read lunction in a
while loop until theie aie no paiis lelt (as ueteimineu Ly the ietuin value rval), anu
3. closes the lile ieauei with avro_file_reader_close.
The avro_file_reader_read lunction accepts a schema as its seconu aigument to sup-
poit the case wheie the schema loi ieauing is uilleient to the one useu when the lile
was wiitten (this is explaineu in the next section), Lut we simply pass in NULL, which
tells Avio to use the uata lile`s schema. The thiiu aigument is a pointei to a
avro_datum_t oLject, which is populateu with the contents ol the next iecoiu ieau liom
the lile. Ve unpack the paii stiuctuie into its lielus Ly calling avro_record_get, anu
then we extiact the value ol these lielus as stiings using avro_string_get, which we
piint to the console.
Running the piogiam using the output ol the Python piogiam piints the oiiginal input:
% ./dump_pairs pairs.avro
Ve have successlully exchangeu complex uata Letween two Avio implementations.
Schema resolution
Ve can choose to use a uilleient schema loi ieauing the uata Lack (the rcadcr`s
schcna) to the one we useu to wiite it (the writcr`s schcna). This is a poweilul tool,
since it enaLles schema evolution. To illustiate, consiuei a new schema loi stiing paiis,
with an auueu description lielu:
"type": "record",
"name": "StringPair",
"doc": "A pair of strings with an added field.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"},
{"name": "description", "type": "string", "default": ""}
S. Avio lunctions anu types have a avro_ pielix anu aie uelineu in the avro.h heauei lile.
Serialization | 123
Ve can use this schema to ieau the uata we seiializeu eailiei, since, ciucially, we have
given the description lielu a uelault value (the empty stiing
), which Avio will use
when theie is no lielu uelineu in the iecoius it is ieauing. Hau we omitteu the default
attiiLute, we woulu get an eiioi when tiying to ieau the olu uata.
To make the uelault value null, iathei than the empty stiing, we woulu
insteau ueline the uesciiption lielu using a union with the null Avio
{"name": "description", "type": ["null", "string"], "default": null}
Vhen the ieauei`s schema is uilleient liom the wiitei`s, we use the constiuctoi loi
GenericDatumReader that takes two schema oLjects, the wiitei`s anu the ieauei`s, in that
DatumReader<GenericRecord> reader =
new GenericDatumReader<GenericRecord>(schema, newSchema);
Decoder decoder = DecoderFactory.get().binaryDecoder(out.toByteArray(), null);
GenericRecord result = reader.read(null, decoder);
assertThat(result.get("left").toString(), is("L"));
assertThat(result.get("right").toString(), is("R"));
assertThat(result.get("description").toString(), is(""));
Foi uata liles, which have the wiitei`s schema stoieu in the metauata, we only neeu to
specily the ieaueis`s schema explicitly, which we can uo Ly passing null loi the wiitei`s
DatumReader<GenericRecord> reader =
new GenericDatumReader<GenericRecord>(null, newSchema);
Anothei common use ol a uilleient ieauei`s schema is to uiop lielus in a iecoiu, an
opeiation calleu projcction. This is uselul when you have iecoius with a laige numLei
ol lielus anu you only want to ieau some ol them. Foi example, this schema can Le
useu to get only the right lielu ol a StringPair:
"type": "record",
"name": "StringPair",
"doc": "The right field of a pair of strings.",
"fields": [
{"name": "right", "type": "string"}
9. Delault values loi lielus aie encoueu using ]SON. See the Avio specilication loi a uesciiption ol this
encouing loi each uata type.
124 | Chapter 4: Hadoop I/O
The iules loi schema iesolution have a uiiect Leaiing on how schemas may evolve liom
one veision to the next, anu aie spelleu out in the Avio specilication loi all Avio types.
A summaiy ol the iules loi iecoiu evolution liom the point ol view ol ieaueis anu
wiiteis (oi seiveis anu clients) is piesenteu in TaLle +-12.
Tab|c 1-12. Schcna rcso|ution oj rccords
New schema Writer Reader Action
Added field Old New The reader uses the default value of the new field, since it is not written by the writer.
New Old The reader does not know about the new field written by the writer, so it is ignored.
Removed field Old New The reader ignores the removed field. (Projection).
New Old The removed field is not written by the writer. If the old schema had a default defined
for the field, then the reader uses this, otherwise it gets an error. In this case, it is best
to update the reader’s schema at the same time as, or before, the writer’s.
Sort order
Avio uelines a soit oiuei loi oLjects. Foi most Avio types, the oiuei is the natuial one
you woulu expect÷loi example, numeiic types aie oiueieu Ly ascenuing numeiic
value. Otheis aie a little moie suLtle÷enums aie compaieu Ly the oiuei in which the
symLol is uelineu anu not Ly the value ol the symLol stiing, loi instance.
All types except record have pieoiuaineu iules loi theii soit oiuei as uesciiLeu in the
Avio specilication; they cannot Le oveiiiuuen Ly the usei. Foi iecoius, howevei, you
can contiol the soit oiuei Ly specilying the order attiiLute loi a lielu. It takes one ol
thiee values: ascending (the uelault), descending (to ieveise the oiuei), oi ignore (so
the lielu is skippeu loi compaiison puiposes).
Foi example, the lollowing schema (SortcdStringPair.avsc) uelines an oiueiing ol
StringPair iecoius Ly the right lielu in uescenuing oiuei. The left lielu is ignoieu loi
the puiposes ol oiueiing, Lut it is still piesent in the piojection:
"type": "record",
"name": "StringPair",
"doc": "A pair of strings, sorted by right field descending.",
"fields": [
{"name": "left", "type": "string", "order": "ignore"},
{"name": "right", "type": "string", "order": "descending"}
The iecoiu`s lielus aie compaieu paiiwise in the uocument oiuei ol the ieauei`s schema.
Thus, Ly specilying an appiopiiate ieauei`s schema, you can impose an aiLitiaiy
oiueiing on uata iecoius. This schema (SwitchcdStringPair.avsc) uelines a soit oiuei
Ly the right lielu, then the left:
Serialization | 125
"type": "record",
"name": "StringPair",
"doc": "A pair of strings, sorted by right then left.",
"fields": [
{"name": "right", "type": "string"},
{"name": "left", "type": "string"}
Avio implements ellicient Linaiy compaiisons. That is to say, Avio uoes not have to
ueseiialize a Linaiy uata into oLjects to peiloim the compaiison, since it can insteau
woik uiiectly on the Lyte stieams.
In the case ol the oiiginal StringPair schema (with
no order attiiLutes), loi example, Avio implements the Linaiy compaiison as lollows.
The liist lielu, left, is a UTF-S-encoueu stiing, loi which Avio can compaie the Lytes
lexicogiaphically. Il they uillei, then the oiuei is ueteimineu, anu Avio can stop the
compaiison theie. Otheiwise, il the two Lyte seguences aie the same, it compaies the
seconu two (right) lielus, again lexicogiaphically at the Lyte level since the lielu is
anothei UTF-S stiing.
Notice that this uesciiption ol a compaiison lunction has exactly the same logic as the
Linaiy compaiatoi we wiote loi ViitaLles in ¨Implementing a RawCompaiatoi loi
speeu¨ on page 10S. The gieat thing is that Avio pioviues the compaiatoi loi us, so we
uon`t have to wiite anu maintain this coue. It`s also easy to change the soit oiuei just
Ly changing the ieauei`s schema. Foi the SortcdStringPair.avsc oi SwitchcdString-
Pair.avsc schemas, the compaiison lunction Avio uses is essentially the same as the one
just uesciiLeu: the uilleience is in which lielus aie consiueieu, the oiuei in which they
aie consiueieu, anu whethei the oiuei is ascenuing oi uescenuing.
Latei in the chaptei we`ll use Avio`s soiting logic in conjunction with MapReuuce to
soit Avio uata liles in paiallel.
Avro MapReduce
Avio pioviues a numLei ol classes loi making it easy to iun MapReuuce piogiams on
Avio uata. Foi example, AvroMapper anu AvroReducer in the org.apache.avro.mapred
package aie specializations ol Hauoop`s (olu style) Mapper anu Reducer classes. They
eliminate the key-value uistinction loi inputs anu outputs, since Avio uata liles aie just
a seguence ol values. Howevei, inteimeuiate uata is still uiviueu into key-value paiis
loi the shullle.
Let`s iewoik the MapReuuce piogiam loi linuing the maximum tempeiatuie loi each
yeai in the weathei uataset, using the Avio MapReuuce API. Ve will iepiesent weathei
iecoius using the lollowing schema:
10. A uselul conseguence ol this piopeity is that you can compute an Avio uatum`s hash coue liom eithei
the oLject oi the Linaiy iepiesentation (the lattei Ly using the static hashCode() methou on BinaryData)
anu get the same iesult in Loth cases.
126 | Chapter 4: Hadoop I/O
"type": "record",
"name": "WeatherRecord",
"doc": "A weather reading.",
"fields": [
{"name": "year", "type": "int"},
{"name": "temperature", "type": "int"},
{"name": "stationId", "type": "string"}
The piogiam in Example +-12 ieaus text input (in the loimat we saw in eailiei chapteis),
anu wiites Avio uata liles containing weathei iecoius as output.
Exanp|c 1-12. MapRcducc progran to jind thc naxinun tcnpcraturc, crcating Avro output
import java.io.IOException;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.mapred.AvroCollector;
import org.apache.avro.mapred.AvroJob;
import org.apache.avro.mapred.AvroMapper;
import org.apache.avro.mapred.AvroReducer;
import org.apache.avro.mapred.AvroUtf8InputFormat;
import org.apache.avro.mapred.Pair;
import org.apache.avro.util.Utf8;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class AvroGenericMaxTemperature extends Configured implements Tool {

private static final Schema SCHEMA = new Schema.Parser().parse(
"{" +
" \"type\": \"record\"," +
" \"name\": \"WeatherRecord\"," +
" \"doc\": \"A weather reading.\"," +
" \"fields\": [" +
" {\"name\": \"year\", \"type\": \"int\"}," +
" {\"name\": \"temperature\", \"type\": \"int\"}," +
" {\"name\": \"stationId\", \"type\": \"string\"}" +
" ]" +
public static class MaxTemperatureMapper
extends AvroMapper<Utf8, Pair<Integer, GenericRecord>> {
Serialization | 127
private NcdcRecordParser parser = new NcdcRecordParser();
private GenericRecord record = new GenericData.Record(SCHEMA);
public void map(Utf8 line,
AvroCollector<Pair<Integer, GenericRecord>> collector,
Reporter reporter) throws IOException {
if (parser.isValidTemperature()) {
record.put("year", parser.getYearInt());
record.put("temperature", parser.getAirTemperature());
record.put("stationId", parser.getStationId());
new Pair<Integer, GenericRecord>(parser.getYearInt(), record));

public static class MaxTemperatureReducer
extends AvroReducer<Integer, GenericRecord, GenericRecord> {
public void reduce(Integer key, Iterable<GenericRecord> values,
AvroCollector<GenericRecord> collector, Reporter reporter)
throws IOException {
GenericRecord max = null;
for (GenericRecord value : values) {
if (max == null ||
(Integer) value.get("temperature") > (Integer) max.get("temperature")) {
max = newWeatherRecord(value);
private GenericRecord newWeatherRecord(GenericRecord value) {
GenericRecord record = new GenericData.Record(SCHEMA);
record.put("year", value.get("year"));
record.put("temperature", value.get("temperature"));
record.put("stationId", value.get("stationId"));
return record;
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
return -1;

JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Max temperature");

FileInputFormat.addInputPath(conf, new Path(args[0]));
128 | Chapter 4: Hadoop I/O
FileOutputFormat.setOutputPath(conf, new Path(args[1]));

AvroJob.setInputSchema(conf, Schema.create(Schema.Type.STRING));
Pair.getPairSchema(Schema.create(Schema.Type.INT), SCHEMA));
AvroJob.setOutputSchema(conf, SCHEMA);

AvroJob.setMapperClass(conf, MaxTemperatureMapper.class);
AvroJob.setReducerClass(conf, MaxTemperatureReducer.class);
return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AvroGenericMaxTemperature(), args);
This piogiam uses the geneiic Avio mapping. This liees us liom geneiating coue to
iepiesent iecoius, at the expense ol type salety (lielu names aie ieleiieu to Ly stiing
value, such as "temperature").
The schema loi weathei iecoius is inlineu in the coue
loi convenience (anu ieau into the SCHEMA constant), although in piactice it might Le
moie maintainaLle to ieau the schema liom a local lile in the uiivei coue anu pass it to
the mappei anu ieuucei via the Hauoop joL conliguiation. (Technigues loi achieving
this aie uiscusseu in ¨Siue Data DistiiLution¨ on page 2S7.)
Theie aie a couple ol uilleiences liom the iegulai Hauoop MapReuuce API. The liist
is the use ol a org.apache.avro.mapred.Pair to wiap the map output key anu value in
MaxTemperatureMapper. (The ieason that the org.apache.avro.mapred.AvroMapper
uoesn`t have a lixeu output key anu value is so that map-only joLs can emit just values
to Avio uata liles.) Foi this MapReuuce piogiam the key is the yeai (an integei), anu
the value is the weathei iecoiu, which is iepiesenteu Ly Avio`s GenericRecord.
Avio MapReuuce uoes pieseive the notion ol key-value paiis loi the input to the ieuucei
howevei, since this is what comes out ol the shullle, anu it unwiaps the Pair Leloie
invoking the org.apache.avro.mapred.AvroReducer. The MaxTemperatureReducer itei-
ates thiough the iecoius loi each key (yeai) anu linus the one with the maximum tem-
peiatuie. It is necessaiy to make a copy ol the iecoiu with the highest tempeiatuie
lounu so lai, since the iteiatoi ieuses the instance loi ieasons ol elliciency (anu only
the lielus aie upuateu).
The seconu majoi uilleience liom iegulai MapReuuce is the use ol AvroJob loi conlig-
uiing the joL. AvroJob is a convenience class loi specilying the Avio schemas loi the
11. Foi an example that uses the specilic mapping, with geneiateu classes, see the
AvroSpecificMaxTemperature class in the example coue.
Serialization | 129
input, map output anu linal output uata. In this piogiam the input schema is an Avio
string, Lecause we aie ieauing liom a text lile, anu the input loimat is set coiiesponu-
ingly, to AvroUtf8InputFormat. The map output schema is a paii schema whose key
schema is an Avio int anu whose value schema is the weathei iecoiu schema. The linal
output schema is the weathei iecoiu schema, anu the output loimat is the uelault,
AvroOutputFormat, which wiites to Avio uata liles.
The lollowing commanu line shows how to iun the piogiam on a small sample uataset:
% hadoop jar avro-examples.jar AvroGenericMaxTemperature \
input/ncdc/sample.txt output
On completion we can look at the output using the Avio tools ]AR to ienuei the Avio
uata lile as ]SON, one iecoiu pei line:
% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro
In this example, we useu an AvroMapper anu an AvroReducer, Lut the API suppoits a
mixtuie ol iegulai MapReuuce mappeis anu ieuuceis with Avio-specilic ones, which
is uselul loi conveiting Letween Avio loimats anu othei loimats, such as SeguenceFiles.
The uocumentation loi the Avio MapReuuce package has uetails.
Sorting using Avro MapReduce
In this section we use Avio`s soit capaLilities anu comLine them with MapReuuce to
wiite a piogiam to soit an Avio uata lile (Example +-13).
Exanp|c 1-13. A MapRcducc progran to sort an Avro data ji|c
public class AvroSort extends Configured implements Tool {
static class SortMapper<K> extends AvroMapper<K, Pair<K, K>> {
public void map(K datum, AvroCollector<Pair<K, K>> collector,
Reporter reporter) throws IOException {
collector.collect(new Pair<K, K>(datum, null, datum, null));
static class SortReducer<K> extends AvroReducer<K, K, K> {
public void reduce(K key, Iterable<K> values,
AvroCollector<K> collector,
Reporter reporter) throws IOException {
for (K value : values) {
public int run(String[] args) throws Exception {

if (args.length != 3) {
130 | Chapter 4: Hadoop I/O
"Usage: %s [generic options] <input> <output> <schema-file>\n",
return -1;

String input = args[0];
String output = args[1];
String schemaFile = args[2];
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Avro sort");

FileInputFormat.addInputPath(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));

Schema schema = new Schema.Parser().parse(new File(schemaFile));
AvroJob.setInputSchema(conf, schema);
Schema intermediateSchema = Pair.getPairSchema(schema, schema);
AvroJob.setMapOutputSchema(conf, intermediateSchema);
AvroJob.setOutputSchema(conf, schema);

AvroJob.setMapperClass(conf, SortMapper.class);
AvroJob.setReducerClass(conf, SortReducer.class);

return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new AvroSort(), args);
This piogiam (which uses the geneiic Avio mapping anu hence uoes not ieguiie any
coue geneiation) can soit Avio iecoius ol any type, iepiesenteu in ]ava Ly the geneiic
type paiametei K. Ve choose a value that is the same as the key, so that when the values
aie gioupeu Ly key we can emit all ol the values in the case that moie than one ol them
shaie the same key (accoiuing to the soiting lunction), theieLy not losing any ie-
The mappei simply emits a org.apache.avro.mapred.Pair oLject with this key
anu value. The ieuucei acts as an iuentity, passing the values thiough to the (single-
valueu) output, which will get wiitten to an Avio uata lile.
The soiting happens in the MapReuuce shullle, anu the soit lunction is ueteimineu Ly
the Avio schema that is passeu to the piogiam. Let`s use the piogiam to soit the
pairs.avro lile cieateu eailiei, using the SortcdStringPair.avsc schema to soit Ly the iight
lielu in uescenuing oiuei. Fiist we inspect the input using the Avio tools ]AR:
12. Ve encountei this iuea ol uuplicating inloimation liom the key in the value oLject again in ¨Seconuaiy
Soit¨ on page 276.
Serialization | 131
% java -jar $AVRO_HOME/avro-tools-*.jar tojson input/avro/pairs.avro
Then we iun the soit:
% hadoop jar avro-examples.jar AvroSort input/avro/pairs.avro output \
Finally we inspect the output anu see that it is soiteu coiiectly.
% java -jar $AVRO_HOME/avro-tools-*.jar tojson output/part-00000.avro
Avro MapReduce in other languages
Foi languages othei than ]ava, theie aie a lew choices loi woiking with Avio uata.
AvroAsTextInputFormat is uesigneu to allow Hauoop Stieaming piogiams to ieau Avio
uata liles. Each uatum in the lile is conveiteu to a stiing, which is the ]SON iepiesen-
tation ol the uatum, oi just the iaw Lytes il the type is Avio bytes. Going the othei way,
you can specily AvroTextOutputFormat as the output loimat ol a Stieaming joL to cieate
Avio uata liles with a bytes schema, wheie each uatum is the taL-uelimiteu key-value
paii wiitten liom the Stieaming output. Both these classes can Le lounu in the
org.apache.avro.mapred package.
Foi a iichei inteilace than Stieaming, Avio pioviues a connectoi liamewoik (in the
org.apache.avro.mapred.tether package), which is the Avio analog ol Hauoop Pipes.
At the time ol wiiting, theie aie no Linuings loi othei languages, Lut a Python imple-
mentation will Le availaLle in a lutuie ielease.
Also woith consiueiing aie Pig anu Hive, which can Loth ieau anu wiite Avio uata liles
Ly specilying the appiopiiate stoiage loimats.
File-Based Data Structures
Foi some applications, you neeu a specializeu uata stiuctuie to holu youi uata. Foi
uoing MapReuuce-Laseu piocessing, putting each LloL ol Linaiy uata into its own lile
uoesn`t scale, so Hauoop uevelopeu a numLei ol highei-level containeis loi these
Imagine a loglile, wheie each log iecoiu is a new line ol text. Il you want to log Linaiy
types, plain text isn`t a suitaLle loimat. Hauoop`s SequenceFile class lits the Lill in this
132 | Chapter 4: Hadoop I/O
situation, pioviuing a peisistent uata stiuctuie loi Linaiy key-value paiis. To use it as
a loglile loimat, you woulu choose a key, such as timestamp iepiesenteu Ly a LongWrit
able, anu the value is a Writable that iepiesents the guantity Leing loggeu.
SequenceFiles also woik well as containeis loi smallei liles. HDFS anu MapReuuce
aie optimizeu loi laige liles, so packing liles into a SequenceFile makes stoiing
anu piocessing the smallei liles moie ellicient. (¨Piocessing a whole lile as a ie-
coiu¨ on page 2+0 contains a piogiam to pack liles into a SequenceFile.
Writing a SequenceFile
To cieate a SequenceFile, use one ol its createWriter() static methous, which ietuins
a SequenceFile.Writer instance. Theie aie seveial oveiloaueu veisions, Lut they all
ieguiie you to specily a stieam to wiite to (eithei a FSDataOutputStream oi a FileSys
tem anu Path paiiing), a Configuration oLject, anu the key anu value types. Optional
aiguments incluue the compiession type anu couec, a Progressable callLack to Le in-
loimeu ol wiite piogiess, anu a Metadata instance to Le stoieu in the SequenceFile
The keys anu values stoieu in a SequenceFile uo not necessaiily neeu to Le Writable.
Any types that can Le seiializeu anu ueseiializeu Ly a Serialization may Le useu.
Once you have a SequenceFile.Writer, you then wiite key-value paiis, using the
append() methou. Then when you`ve linisheu, you call the close() methou (Sequence
File.Writer implements java.io.Closeable).
Example +-1+ shows a shoit piogiam to wiite some key-value paiis to a Sequence
File, using the API just uesciiLeu.
Exanp|c 1-11. Writing a ScqucnccIi|c
public class SequenceFileWriteDemo {

private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
13. In a similai vein, the Llog post ¨A Million Little Files¨ Ly Stuait Sieiia incluues coue loi conveiting a
tar lile into a SequenceFile, http://stuartsicrra.con/2008/01/21/a-ni||ion-|itt|c-ji|cs.
File-Based Data Structures | 133
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());

for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
} finally {
The keys in the seguence lile aie integeis counting uown liom 100 to 1, iepiesenteu as
IntWritable oLjects. The values aie Text oLjects. Beloie each iecoiu is appenueu to the
SequenceFile.Writer, we call the getLength() methou to uiscovei the cuiient position
in the lile. (Ve will use this inloimation aLout iecoiu Lounuaiies in the next section
when we ieau the lile nonseguentially.) Ve wiite the position out to the console, along
with the key anu value paiis. The iesult ol iunning it is shown heie:
% hadoop SequenceFileWriteDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[1976] 60 One, two, buckle my shoe
[2021] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
Reading a SequenceFile
Reauing seguence liles liom Leginning to enu is a mattei ol cieating an instance ol
SequenceFile.Reader anu iteiating ovei iecoius Ly iepeateuly invoking one ol the
134 | Chapter 4: Hadoop I/O
next() methous. Vhich one you use uepenus on the seiialization liamewoik you aie
using. Il you aie using Writable types, you can use the next() methou that takes a key
anu a value aigument, anu ieaus the next key anu value in the stieam into these
public boolean next(Writable key, Writable val)
The ietuin value is true il a key-value paii was ieau anu false il the enu ol the lile has
Leen ieacheu.
Foi othei, nonWritable seiialization liamewoiks (such as Apache Thiilt), you shoulu
use these two methous:
public Object next(Object key) throws IOException
public Object getCurrentValue(Object val) throws IOException
In this case, you neeu to make suie that the seiialization you want to use has Leen set
in the io.serializations piopeity; see ¨Seiialization Fiamewoiks¨ on page 110.
Il the next() methou ietuins a non-null oLject, a key-value paii was ieau liom the
stieam, anu the value can Le ietiieveu using the getCurrentValue() methou. Otheiwise,
il next() ietuins null, the enu ol the lile has Leen ieacheu.
The piogiam in Example +-15 uemonstiates how to ieau a seguence lile that has
Writable keys anu values. Note how the types aie uiscoveieu liom the Sequence
File.Reader via calls to getKeyClass() anu getValueClass(), then ReflectionUtils is
useu to cieate an instance loi the key anu an instance loi the value. By using this tech-
nigue, the piogiam can Le useu with any seguence lile that has Writable keys anu values.
Exanp|c 1-15. Rcading a ScqucnccIi|c
public class SequenceFileReadDemo {

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
} finally {
File-Based Data Structures | 135
Anothei leatuie ol the piogiam is that it uisplays the position ol the sync points in the
seguence lile. A sync point is a point in the stieam that can Le useu to iesynchionize
with a iecoiu Lounuaiy il the ieauei is ¨lost¨÷loi example, altei seeking to an aiLitiaiy
position in the stieam. Sync points aie iecoiueu Ly SequenceFile.Writer, which inseits
a special entiy to maik the sync point eveiy lew iecoius as a seguence lile is Leing
wiitten. Such entiies aie small enough to incui only a mouest stoiage oveiheau÷less
than 1º. Sync points always align with iecoiu Lounuaiies.
Running the piogiam in Example +-15 shows the sync points in the seguence lile as
asteiisks. The liist one occuis at position 2021 (the seconu one occuis at position +075,
Lut is not shown in the output):
% hadoop SequenceFileReadDemo numbers.seq
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
[4557] 5 One, two, buckle my shoe
[4602] 4 Three, four, shut the door
[4649] 3 Five, six, pick up sticks
[4693] 2 Seven, eight, lay them straight
[4743] 1 Nine, ten, a big fat hen
Theie aie two ways to seek to a given position in a seguence lile. The liist is the
seek() methou, which positions the ieauei at the given point in the lile. Foi example,
seeking to a iecoiu Lounuaiy woiks as expecteu:
assertThat(reader.next(key, value), is(true));
assertThat(((IntWritable) key).get(), is(95));
But il the position in the lile is not at a iecoiu Lounuaiy, the ieauei lails when the
next() methou is calleu:
reader.next(key, value); // fails with IOException
136 | Chapter 4: Hadoop I/O
The seconu way to linu a iecoiu Lounuaiy makes use ol sync points. The sync(long
position) methou on SequenceFile.Reader positions the ieauei at the next sync point
altei position. (Il theie aie no sync points in the lile altei this position, then the ieauei
will Le positioneu at the enu ol the lile.) Thus, we can call sync() with any position in
the stieam÷a noniecoiu Lounuaiy, loi example÷anu the ieauei will ieestaLlish itsell
at the next sync point so ieauing can continue:
assertThat(reader.getPosition(), is(2021L));
assertThat(reader.next(key, value), is(true));
assertThat(((IntWritable) key).get(), is(59));
SequenceFile.Writer has a methou calleu sync() loi inseiting a sync
point at the cuiient position in the stieam. This is not to Le conluseu
with the iuentically nameu Lut otheiwise unielateu sync() methou
uelineu Ly the Syncable inteilace loi synchionizing Lulleis to the
unueilying uevice.
Sync points come into theii own when using seguence liles as input to MapReuuce,
since they peimit the lile to Le split, so uilleient poitions ol it can Le piocesseu inue-
penuently Ly sepaiate map tasks. See ¨SeguenceFileInputFoimat¨ on page 2+7.
Displaying a SequenceFile with the command-line interface
The hadoop fs commanu has a -text option to uisplay seguence liles in textual loim.
It looks at a lile`s magic numLei so that it can attempt to uetect the type ol the lile anu
appiopiiately conveit it to text. It can iecognize gzippeu liles anu seguence liles; othei-
wise, it assumes the input is plain text.
Foi seguence liles, this commanu is ieally uselul only il the keys anu values have a
meaninglul stiing iepiesentation (as uelineu Ly the toString() methou). Also, il you
have youi own key oi value classes, then you will neeu to make suie they aie on Ha-
uoop`s classpath.
Running it on the seguence lile we cieateu in the pievious section gives the lollowing
% hadoop fs -text numbers.seq | head
100 One, two, buckle my shoe
99 Three, four, shut the door
98 Five, six, pick up sticks
97 Seven, eight, lay them straight
96 Nine, ten, a big fat hen
95 One, two, buckle my shoe
94 Three, four, shut the door
93 Five, six, pick up sticks
92 Seven, eight, lay them straight
91 Nine, ten, a big fat hen
File-Based Data Structures | 137
Sorting and merging SequenceFiles
The most poweilul way ol soiting (anu meiging) one oi moie seguence liles is to use
MapReuuce. MapReuuce is inheiently paiallel anu will let you specily the numLei ol
ieuuceis to use, which ueteimines the numLei ol output paititions. Foi example, Ly
specilying one ieuucei, you get a single output lile. Ve can use the soit example that
comes with Hauoop Ly specilying that the input anu output aie seguence liles, anu Ly
setting the key anu value types:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq sorted
% hadoop fs -text sorted/part-00000 | head
1 Nine, ten, a big fat hen
2 Seven, eight, lay them straight
3 Five, six, pick up sticks
4 Three, four, shut the door
5 One, two, buckle my shoe
6 Nine, ten, a big fat hen
7 Seven, eight, lay them straight
8 Five, six, pick up sticks
9 Three, four, shut the door
10 One, two, buckle my shoe
Soiting is coveieu in moie uetail in ¨Soiting¨ on page 266.
As an alteinative to using MapReuuce loi soit/meige, theie is a SequenceFile.Sorter
class that has a numLei ol sort() anu merge() methous. These lunctions pieuate Map-
Reuuce anu aie lowei-level lunctions than MapReuuce (loi example, to get paiallelism,
you neeu to paitition youi uata manually), so in geneial MapReuuce is the pieleiieu
appioach to soit anu meige seguence liles.
The SequenceFile format
A seguence lile consists ol a heauei lolloweu Ly one oi moie iecoius (see Figuie +-2).
The liist thiee Lytes ol a seguence lile aie the Lytes SEQ, which acts a magic numLei,
lolloweu Ly a single Lyte iepiesenting the veision numLei. The heauei contains othei
lielus incluuing the names ol the key anu value classes, compiession uetails, usei-
uelineu metauata, anu the sync maikei.
Recall that the sync maikei is useu to allow
a ieauei to synchionize to a iecoiu Lounuaiy liom any position in the lile. Each lile has
a ianuomly geneiateu sync maikei, whose value is stoieu in the heauei. Sync maikeis
appeai Letween iecoius in the seguence lile. They aie uesigneu to incui less than a 1º
stoiage oveiheau, so they uon`t necessaiily appeai Letween eveiy paii ol iecoius (such
is the case loi shoit iecoius).
1+. Full uetails ol the loimat ol these lielus may Le lounu in SequenceFile`s uocumentation anu souice coue.
138 | Chapter 4: Hadoop I/O
The inteinal loimat ol the iecoius uepenus on whethei compiession is enaLleu, anu il
it is, whethei it is iecoiu compiession oi Llock compiession.
Il no compiession is enaLleu (the uelault), then each iecoiu is maue up ol the iecoiu
length (in Lytes), the key length, the key, anu then the value. The length lielus aie
wiitten as loui-Lyte integeis auheiing to the contiact ol the writeInt() methou ol
java.io.DataOutput. Keys anu values aie seiializeu using the Serialization uelineu loi
the class Leing wiitten to the seguence lile.
The loimat loi iecoiu compiession is almost iuentical to no compiession, except the
value Lytes aie compiesseu using the couec uelineu in the heauei. Note that keys aie
not compiesseu.
Block compiession compiesses multiple iecoius at once; it is theieloie moie compact
than anu shoulu geneially Le pieleiieu ovei iecoiu compiession Lecause it has the
oppoitunity to take auvantage ol similaiities Letween iecoius. (See Figuie +-3.) Recoius
aie auueu to a Llock until it ieaches a minimum size in Lytes, uelineu Ly the
io.seqfile.compress.blocksize piopeity: the uelault is 1 million Lytes. A sync maikei
is wiitten Leloie the stait ol eveiy Llock. The loimat ol a Llock is a lielu inuicating the
numLei ol iecoius in the Llock, lolloweu Ly loui compiesseu lielus: the key lengths,
the keys, the value lengths, anu the values.
A MapFile is a soiteu SequenceFile with an inuex to peimit lookups Ly key. MapFile can
Le thought ol as a peisistent loim ol java.util.Map (although it uoesn`t implement this
inteilace), which is aLle to giow Leyonu the size ol a Map that is kept in memoiy.
Iigurc 1-2. Thc intcrna| structurc oj a scqucncc ji|c with no conprcssion and rccord conprcssion
File-Based Data Structures | 139
Writing a MapFile
Viiting a MapFile is similai to wiiting a SequenceFile: you cieate an instance ol
MapFile.Writer, then call the append() methou to auu entiies in oiuei. (Attempting to
auu entiies out ol oiuei will iesult in an IOException.) Keys must Le instances ol
WritableComparable, anu values must Le Writable÷contiast this to SequenceFile,
which can use any seiialization liamewoik loi its entiies.
The piogiam in Example +-16 cieates a MapFile, anu wiites some entiies to it. It is veiy
similai to the piogiam in Example +-1+ loi cieating a SequenceFile.
Exanp|c 1-1ó. Writing a MapIi|c
public class MapFileWriteDemo {

private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"

public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
IntWritable key = new IntWritable();
Text value = new Text();
MapFile.Writer writer = null;
try {
writer = new MapFile.Writer(conf, fs, uri,
key.getClass(), value.getClass());

for (int i = 0; i < 1024; i++) {
key.set(i + 1);
value.set(DATA[i % DATA.length]);
Iigurc 1-3. Thc intcrna| structurc oj a scqucncc ji|c with b|oc| conprcssion
140 | Chapter 4: Hadoop I/O
writer.append(key, value);
} finally {
Let`s use this piogiam to Luilu a MapFile:
% hadoop MapFileWriteDemo numbers.map
Il we look at the MapFile, we see it`s actually a uiiectoiy containing two liles calleu
data anu indcx:
% ls -l numbers.map
total 104
-rw-r--r-- 1 tom tom 47898 Jul 29 22:06 data
-rw-r--r-- 1 tom tom 251 Jul 29 22:06 index
Both liles aie SequenceFiles. The data lile contains all ol the entiies, in oiuei:
% hadoop fs -text numbers.map/data | head
1 One, two, buckle my shoe
2 Three, four, shut the door
3 Five, six, pick up sticks
4 Seven, eight, lay them straight
5 Nine, ten, a big fat hen
6 One, two, buckle my shoe
7 Three, four, shut the door
8 Five, six, pick up sticks
9 Seven, eight, lay them straight
10 Nine, ten, a big fat hen
The indcx lile contains a liaction ol the keys, anu contains a mapping liom the key to
that key`s ollset in the data lile:
% hadoop fs -text numbers.map/index
1 128
129 6079
257 12054
385 18030
513 24002
641 29976
769 35947
897 41922
As we can see liom the output, Ly uelault only eveiy 12Sth key is incluueu in the inuex,
although you can change this value eithei Ly setting the io.map.index.interval
piopeity oi Ly calling the setIndexInterval() methou on the MapFile.Writer instance.
A ieason to inciease the inuex inteival woulu Le to ueciease the amount ol memoiy
that the MapFile neeus to stoie the inuex. Conveisely, you might ueciease the inteival
to impiove the time loi ianuom selection (since lewei iecoius neeu to Le skippeu on
aveiage) at the expense ol memoiy usage.
File-Based Data Structures | 141
Since the inuex is only a paitial inuex ol keys, MapFile is not aLle to pioviue methous
to enumeiate, oi even count, all the keys it contains. The only way to peiloim these
opeiations is to ieau the whole lile.
Reading a MapFile
Iteiating thiough the entiies in oiuei in a MapFile is similai to the pioceuuie loi a
SequenceFile: you cieate a MapFile.Reader, then call the next() methou until it ietuins
false, signilying that no entiy was ieau Lecause the enu ol the lile was ieacheu:
public boolean next(WritableComparable key, Writable val) throws IOException
A ianuom access lookup can Le peiloimeu Ly calling the get() methou:
public Writable get(WritableComparable key, Writable val) throws IOException
The ietuin value is useu to ueteimine il an entiy was lounu in the MapFile; il it`s null,
then no value exists loi the given key. Il key was lounu, then the value loi that key is
ieau into val, as well as Leing ietuineu liom the methou call.
It might Le helplul to unueistanu how this is implementeu. Heie is a snippet ol coue
that ietiieves an entiy loi the MapFile we cieateu in the pievious section:
Text value = new Text();
reader.get(new IntWritable(496), value);
assertThat(value.toString(), is("One, two, buckle my shoe"));
Foi this opeiation, the MapFile.Reader ieaus the indcx lile into memoiy (this is cacheu
so that suLseguent ianuom access calls will use the same in-memoiy inuex). The ieauei
then peiloims a Linaiy seaich on the in-memoiy inuex to linu the key in the inuex that
is less than oi egual to the seaich key, +96. In this example, the inuex key lounu is 3S5,
with value 1S030, which is the ollset in the data lile. Next the ieauei seeks to this ollset
in the data lile anu ieaus entiies until the key is gieatei than oi egual to the seaich key,
+96. In this case, a match is lounu anu the value is ieau liom the data lile. Oveiall, a
lookup takes a single uisk seek anu a scan thiough up to 12S entiies on uisk. Foi a
ianuom-access ieau, this is actually veiy ellicient.
The getClosest() methou is like get() except it ietuins the ¨closest¨ match to the
specilieu key, iathei than ietuining null on no match. Moie piecisely, il the MapFile
contains the specilieu key, then that is the entiy ietuineu; otheiwise, the key in the
MapFile that is immeuiately altei (oi Leloie, accoiuing to a boolean aigument) the
specilieu key is ietuineu.
A veiy laige MapFile`s inuex can take up a lot ol memoiy. Rathei than ieinuex to change
the inuex inteival, it is possiLle to loau only a liaction ol the inuex keys into memoiy
when ieauing the MapFile Ly setting the io.map.index.skip piopeity. This piopeity is
noimally 0, which means no inuex keys aie skippeu; a value ol 1 means skip one key
loi eveiy key in the inuex (so eveiy othei key enus up in the inuex), 2 means skip two
keys loi eveiy key in the inuex (so one thiiu ol the keys enu up in the inuex), anu so
142 | Chapter 4: Hadoop I/O
on. Laigei skip values save memoiy Lut at the expense ol lookup time, since moie
entiies have to Le scanneu on uisk, on aveiage.
MapFile variants
Hauoop comes with a lew vaiiants on the geneial key-value MapFile inteilace:
º SetFile is a specialization ol MapFile loi stoiing a set ol Writable keys. The keys
must Le auueu in soiteu oiuei.
º ArrayFile is a MapFile wheie the key is an integei iepiesenting the inuex ol the
element in the aiiay, anu the value is a Writable value.
º BloomMapFile is a MapFile which olleis a last veision ol the get() methou, especially
loi spaisely populateu liles. The implementation uses a uynamic Lloom liltei loi
testing whethei a given key is in the map. The test is veiy last since it is in-memoiy,
Lut it has a non-zeio pioLaLility ol lalse positives, in which case the iegulai
get() methou is calleu.
Theie aie two tuning paiameteis: io.mapfile.bloom.size loi the (appioximate)
numLei ol entiies in the map (uelault 1,0+S,576), anu io.map
file.bloom.error.rate loi the uesiieu maximum eiioi iate (uelault 0.005, which
is 0.5º).
Converting a SequenceFile to a MapFile
One way ol looking at a MapFile is as an inuexeu anu soiteu SequenceFile. So it`s guite
natuial to want to Le aLle to conveit a SequenceFile into a MapFile. Ve coveieu how
to soit a SequenceFile in ¨Soiting anu meiging SeguenceFiles¨ on page 13S, so heie we
look at how to cieate an inuex loi a SequenceFile. The piogiam in Example +-17 hinges
aiounu the static utility methou fix() on MapFile, which ie-cieates the inuex loi a
Exanp|c 1-17. Rc-crcating thc indcx jor a MapIi|c
public class MapFileFixer {
public static void main(String[] args) throws Exception {
String mapUri = args[0];

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create(mapUri), conf);
Path map = new Path(mapUri);
Path mapData = new Path(map, MapFile.DATA_FILE_NAME);

// Get key and value types from data sequence file
SequenceFile.Reader reader = new SequenceFile.Reader(fs, mapData, conf);
Class keyClass = reader.getKeyClass();
Class valueClass = reader.getValueClass();

File-Based Data Structures | 143
// Create the map file index file
long entries = MapFile.fix(fs, map, keyClass, valueClass, false, conf);
System.out.printf("Created MapFile %s with %d entries\n", map, entries);
The fix() methou is usually useu loi ie-cieating coiiupteu inuexes, Lut since it cieates
a new inuex liom sciatch, it`s exactly what we neeu heie. The iecipe is as lollows:
1. Soit the seguence lile nunbcrs.scq into a new uiiectoiy calleu nunbcr.nap that will
Lecome the MapFile (il the seguence lile is alieauy soiteu, then you can skip this
step. Insteau, copy it to a lile nunbcr.nap/data, then go to step 3):
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort -r 1 \
-inFormat org.apache.hadoop.mapred.SequenceFileInputFormat \
-outFormat org.apache.hadoop.mapred.SequenceFileOutputFormat \
-outKey org.apache.hadoop.io.IntWritable \
-outValue org.apache.hadoop.io.Text \
numbers.seq numbers.map
2. Rename the MapReuuce output to Le the data lile:
% hadoop fs -mv numbers.map/part-00000 numbers.map/data
3. Cieate the indcx lile:
% hadoop MapFileFixer numbers.map
Created MapFile numbers.map with 100 entries
The MapFile nunbcrs.nap now exists anu can Le useu.
144 | Chapter 4: Hadoop I/O
Developing a MapReduce Application
In Chaptei 2, we intiouuceu the MapReuuce mouel. In this chaptei, we look at the
piactical aspects ol ueveloping a MapReuuce application in Hauoop.
Viiting a piogiam in MapReuuce has a ceitain llow to it. You stait Ly wiiting youi
map anu ieuuce lunctions, iueally with unit tests to make suie they uo what you expect.
Then you wiite a uiivei piogiam to iun a joL, which can iun liom youi IDE using a
small suLset ol the uata to check that it is woiking. Il it lails, then you can use youi
IDE`s ueLuggei to linu the souice ol the pioLlem. Vith this inloimation, you can
expanu youi unit tests to covei this case anu impiove youi mappei oi ieuucei as ap-
piopiiate to hanule such input coiiectly.
Vhen the piogiam iuns as expecteu against the small uataset, you aie ieauy to unleash
it on a clustei. Running against the lull uataset is likely to expose some moie issues,
which you can lix as Leloie, Ly expanuing youi tests anu mappei oi ieuucei to hanule
the new cases. DeLugging lailing piogiams in the clustei is a challenge, so we look at
some common technigues to make it easiei.
Altei the piogiam is woiking, you may wish to uo some tuning, liist Ly iunning thiough
some stanuaiu checks loi making MapReuuce piogiams lastei anu then Ly uoing task
pioliling. Pioliling uistiiLuteu piogiams is not tiivial, Lut Hauoop has hooks to aiu the
Beloie we stait wiiting a MapReuuce piogiam, we neeu to set up anu conliguie the
uevelopment enviionment. Anu to uo that, we neeu to leain a Lit aLout how Hauoop
uoes conliguiation.
The Configuration API
Components in Hauoop aie conliguieu using Hauoop`s own conliguiation API. An
instance ol the Configuration class (lounu in the org.apache.hadoop.conf package)
iepiesents a collection ol conliguiation propcrtics anu theii values. Each piopeity is
nameu Ly a String, anu the type ol a value may Le one ol seveial types, incluuing ]ava
piimitives such as boolean, int, long, float, anu othei uselul types such as String, Class,
java.io.File, anu collections ol Strings.
Configurations ieau theii piopeities liom rcsourccs÷XML liles with a simple stiuctuie
loi uelining name-value paiis. See Example 5-1.
Exanp|c 5-1. A sinp|c conjiguration ji|c, conjiguration-1.xn|
<?xml version="1.0"?>



<description>Size and weight</description>
Assuming this conliguiation lile is in a lile calleu conjiguration-1.xn|, we can access its
piopeities using a piece ol coue like this:
Configuration conf = new Configuration();
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
146 | Chapter 5: Developing a MapReduce Application
Theie aie a couple ol things to note: type inloimation is not stoieu in the XML lile;
insteau, piopeities can Le inteipieteu as a given type when they aie ieau. Also, the
get() methous allow you to specily a uelault value, which is useu il the piopeity is not
uelineu in the XML lile, as in the case ol breadth heie.
Combining Resources
Things get inteiesting when moie than one iesouice is useu to ueline a conliguiation.
This is useu in Hauoop to sepaiate out the uelault piopeities loi the system, uelineu
inteinally in a lile calleu corc-dcjau|t.xn|, liom the site-specilic oveiiiues, in corc-
sitc.xn|. The lile in Example 5-2 uelines the size anu weight piopeities.
Exanp|c 5-2. A sccond conjiguration ji|c, conjiguration-2.xn|
<?xml version="1.0"?>

Resouices aie auueu to a Configuration in oiuei:
Configuration conf = new Configuration();
Piopeities uelineu in iesouices that aie auueu latei oveiiiue the eailiei uelinitions. So
the size piopeity takes its value liom the seconu conliguiation lile, conjiguration-2.xn|:
assertThat(conf.getInt("size", 0), is(12));
Howevei, piopeities that aie maikeu as final cannot Le oveiiiuuen in latei uelinitions.
The weight piopeity is final in the liist conliguiation lile, so the attempt to oveiiiue it
in the seconu lails, anu it takes the value liom the liist:
assertThat(conf.get("weight"), is("heavy"));
Attempting to oveiiiue linal piopeities usually inuicates a conliguiation eiioi, so this
iesults in a waining message Leing loggeu to aiu uiagnosis. Auministiatois maik piop-
eities as linal in the uaemon`s site liles that they uon`t want useis to change in theii
client-siue conliguiation liles oi joL suLmission paiameteis.
The Configuration API | 147
Variable Expansion
Conliguiation piopeities can Le uelineu in teims ol othei piopeities, oi system piop-
eities. Foi example, the piopeity size-weight in the liist conliguiation lile is uelineu
as ${size},${weight}, anu these piopeities aie expanueu using the values lounu in the
assertThat(conf.get("size-weight"), is("12,heavy"));
System piopeities take piioiity ovei piopeities uelineu in iesouice liles:
System.setProperty("size", "14");
assertThat(conf.get("size-weight"), is("14,heavy"));
This leatuie is uselul loi oveiiiuing piopeities on the commanu line Ly using
-Dproperty=value ]VM aiguments.
Note that while conliguiation piopeities can Le uelineu in teims ol system piopeities,
unless system piopeities aie ieuelineu using conliguiation piopeities, they aie not ac-
cessiLle thiough the conliguiation API. Hence:
System.setProperty("length", "2");
assertThat(conf.get("length"), is((String) null));
Configuring the Development Environment
The liist step is to uownloau the veision ol Hauoop that you plan to use anu unpack
it on youi uevelopment machine (this is uesciiLeu in Appenuix A). Then, in youi la-
voiite IDE, cieate a new pioject anu auu all the ]AR liles liom the top level ol the
unpackeu uistiiLution anu liom the |ib uiiectoiy to the classpath. You will then Le aLle
to compile ]ava Hauoop piogiams anu iun them in local (stanualone) moue within the
Foi Eclipse useis, theie is a plug-in availaLle loi Liowsing HDFS anu
launching MapReuuce piogiams. Instiuctions aie availaLle on the Ha-
uoop wiki at http://wi|i.apachc.org/hadoop/Ec|ipscP|ug|n.
Alteinatively, Kaimaspheie pioviues Eclipse anu NetBeans plug-ins loi
ueveloping anu iunning MapReuuce joLs anu Liowsing Hauoop clus-
Managing Configuration
Vhen ueveloping Hauoop applications, it is common to switch Letween iunning the
application locally anu iunning it on a clustei. In lact, you may have seveial clusteis
you woik with, oi you may have a local ¨pseuuo-uistiiLuteu¨ clustei that you like to
test on (a pseuuo-uistiiLuteu clustei is one whose uaemons all iun on the local machine;
setting up this moue is coveieu in Appenuix A, too).
148 | Chapter 5: Developing a MapReduce Application
One way to accommouate these vaiiations is to have Hauoop conliguiation liles con-
taining the connection settings loi each clustei you iun against, anu specily which one
you aie using when you iun Hauoop applications oi tools. As a mattei ol Lest piactice,
it`s iecommenueu to keep these liles outsiue Hauoop`s installation uiiectoiy tiee, as
this makes it easy to switch Letween Hauoop veisions without uuplicating oi losing
Foi the puiposes ol this Look, we assume the existence ol a uiiectoiy calleu conj that
contains thiee conliguiation liles: hadoop-|oca|.xn|, hadoop-|oca|host.xn|, anu
hadoop-c|ustcr.xn| (these aie availaLle in the example coue loi this Look). Note that
theie is nothing special aLout the names ol these liles÷they aie just convenient ways
to package up some conliguiation settings. (Compaie this to TaLle A-1 in Appen-
uix A, which sets out the eguivalent seivei-siue conliguiations.)
The hadoop-|oca|.xn| lile contains the uelault Hauoop conliguiation loi the uelault
lilesystem anu the joLtiackei:
<?xml version="1.0"?>


The settings in hadoop-|oca|host.xn| point to a namenoue anu a joLtiackei Loth iun-
ning on localhost:
<?xml version="1.0"?>


Configuring the Development Environment | 149
Finally, hadoop-c|ustcr.xn| contains uetails ol the clustei`s namenoue anu joLtiackei
auuiesses. In piactice, you woulu name the lile altei the name ol the clustei, iathei
than ¨clustei¨ as we have heie:
<?xml version="1.0"?>


You can auu othei conliguiation piopeities to these liles as neeueu. Foi example, il you
wanteu to set youi Hauoop useiname loi a paiticulai clustei, you coulu uo it in the
appiopiiate lile.
Setting User Identity
The usei iuentity that Hauoop uses loi peimissions in HDFS is ueteimineu Ly iunning
the whoami commanu on the client system. Similaily, the gioup names aie ueiiveu liom
the output ol iunning groups.
Il, howevei, youi Hauoop usei iuentity is uilleient liom the name ol youi usei account
on youi client machine, then you can explicitly set youi Hauoop useiname anu gioup
names Ly setting the hadoop.job.ugi piopeity. The useiname anu gioup names aie
specilieu as a comma-sepaiateu list ol stiings (e.g., preston,directors,inventors woulu
set the useiname to preston anu the gioup names to directors anu inventors).
You can set the usei iuentity that the HDFS weL inteilace iuns as Ly setting
dfs.web.ugi using the same syntax. By uelault, it is webuser,webgroup, which is not a
supei usei, so system liles aie not accessiLle thiough the weL inteilace.
Notice that, Ly uelault, theie is no authentication with this system. See ¨Secu-
iity¨ on page 323 loi how to use KeiLeios authentication with Hauoop.
Vith this setup, it is easy to use any conliguiation with the -conf commanu-line switch.
Foi example, the lollowing commanu shows a uiiectoiy listing on the HDFS seivei
iunning in pseuuo-uistiiLuteu moue on localhost:
% hadoop fs -conf conf/hadoop-localhost.xml -ls .
Found 2 items
drwxr-xr-x - tom supergroup 0 2009-04-08 10:32 /user/tom/input
drwxr-xr-x - tom supergroup 0 2009-04-08 13:09 /user/tom/output
150 | Chapter 5: Developing a MapReduce Application
Il you omit the -conf option, then you pick up the Hauoop conliguiation in the conj
suLuiiectoiy unuei $HADOOP_INSTALL. Depenuing on how you set this up, this may Le
loi a stanualone setup oi a pseuuo-uistiiLuteu clustei.
Tools that come with Hauoop suppoit the -conf option, Lut it`s also stiaightloiwaiu
to make youi piogiams (such as piogiams that iun MapReuuce joLs) suppoit it, too,
using the Tool inteilace.
GenericOptionsParser, Tool, and ToolRunner
Hauoop comes with a lew helpei classes loi making it easiei to iun joLs liom the
commanu line. GenericOptionsParser is a class that inteipiets common Hauoop
commanu-line options anu sets them on a Configuration oLject loi youi application to
use as uesiieu. You uon`t usually use GenericOptionsParser uiiectly, as it`s moie
convenient to implement the Tool inteilace anu iun youi application with the
ToolRunner, which uses GenericOptionsParser inteinally:
public interface Tool extends Configurable {
int run(String [] args) throws Exception;
Example 5-3 shows a veiy simple implementation ol Tool, loi piinting the keys anu
values ol all the piopeities in the Tool`s Configuration oLject.
Exanp|c 5-3. An cxanp|c Too| inp|cncntation jor printing thc propcrtics in a Conjiguration
public class ConfigurationPrinter extends Configured implements Tool {

static {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
for (Entry<String, String> entry: conf) {
System.out.printf("%s=%s\n", entry.getKey(), entry.getValue());
return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new ConfigurationPrinter(), args);
Configuring the Development Environment | 151
Ve make ConfigurationPrinter a suLclass ol Configured, which is an implementation
ol the Configurable inteilace. All implementations ol Tool neeu to implement
Configurable (since Tool extenus it), anu suLclassing Configured is olten the easiest way
to achieve this. The run() methou oLtains the Configuration using Configurable`s
getConf() methou anu then iteiates ovei it, piinting each piopeity to stanuaiu output.
The static Llock makes suie that the HDFS anu MapReuuce conliguiations aie pickeu
up in auuition to the coie ones (which Configuration knows aLout alieauy).
ConfigurationPrinter`s main() methou uoes not invoke its own run() methou uiiectly.
Insteau, we call ToolRunner`s static run() methou, which takes caie ol cieating a
Configuration oLject loi the Tool, Leloie calling its run() methou. ToolRunner also uses
a GenericOptionsParser to pick up any stanuaiu options specilieu on the commanu line
anu set them on the Configuration instance. Ve can see the ellect ol picking up the
piopeities specilieu in conj/hadoop-|oca|host.xn| Ly iunning the lollowing commanu:
% hadoop ConfigurationPrinter -conf conf/hadoop-localhost.xml \
| grep mapred.job.tracker=
Which Properties Can I Set?
ConfigurationPrinter is a uselul tool loi telling you what a piopeity is set to in youi
You can also see the uelault settings loi all the puLlic piopeities in Hauoop Ly looking
in the docs uiiectoiy ol youi Hauoop installation loi HTML liles calleu corc-
dcjau|t.htn|, hdjs-dcjau|t.htn| anu naprcd-dcjau|t.htn|. Each piopeity has a uesciip-
tion that explains what it is loi anu what values it can Le set to.
Be awaie that some piopeities have no ellect when set in the client conliguiation. Foi
example, il in youi joL suLmission you set mapred.tasktracker.map.tasks.maximum with
the expectation that it woulu change the numLei ol task slots loi the tasktiackeis iun-
ning youi joL, then you woulu Le uisappointeu, since this piopeity is only honoieu il
set in the tasktiackei`s naprcd-sitc.xn| lile. In geneial, you can tell the component
wheie a piopeity shoulu Le set Ly its name, so the lact that mapred.task
tracker.map.tasks.maximum staits with mapred.tasktracker gives you a clue that it can
Le set only loi the tasktiackei uaemon. This is not a haiu anu last iule, howevei, so in
some cases you may neeu to iesoit to tiial anu eiioi, oi even ieauing the souice.
Ve uiscuss many ol Hauoop`s most impoitant conliguiation piopeities thioughout
this Look. You can linu a conliguiation piopeity ieleience on the Look`s weLsite at
GenericOptionsParser also allows you to set inuiviuual piopeities. Foi example:
% hadoop ConfigurationPrinter -D color=yellow | grep color
152 | Chapter 5: Developing a MapReduce Application
The -D option is useu to set the conliguiation piopeity with key color to the value
yellow. Options specilieu with -D take piioiity ovei piopeities liom the conliguiation
liles. This is veiy uselul: you can put uelaults into conliguiation liles anu then oveiiiue
them with the -D option as neeueu. A common example ol this is setting the numLei
ol ieuuceis loi a MapReuuce joL via -D mapred.reduce.tasks=n. This will oveiiiue the
numLei ol ieuuceis set on the clustei oi set in any client-siue conliguiation liles.
The othei options that GenericOptionsParser anu ToolRunner suppoit aie listeu in Ta-
Lle 5-1. You can linu moie on Hauoop`s conliguiation API in ¨The Conliguiation
API¨ on page 1+6.
Do not conluse setting Hauoop piopeities using the -D
property=value option to GenericOptionsParser (anu ToolRunner) with
setting ]VM system piopeities using the -Dproperty=value option to the
java commanu. The syntax loi ]VM system piopeities uoes not allow
any whitespace Letween the D anu the piopeity name, wheieas
GenericOptionsParser ieguiies them to Le sepaiateu Ly whitespace.
]VM system piopeities aie ietiieveu liom the java.lang.System class,
wheieas Hauoop piopeities aie accessiLle only liom a Configuration
oLject. So, the lollowing commanu will piint nothing, since the
System class is not useu Ly ConfigurationPrinter:
% hadoop -Dcolor=yellow ConfigurationPrinter | grep color
Il you want to Le aLle to set conliguiation thiough system piopeities,
then you neeu to miiioi the system piopeities ol inteiest in the
conliguiation lile. See ¨VaiiaLle Expansion¨ on page 1+S loi luithei
Tab|c 5-1. GcncricOptionsParscr and Too|Runncr options
Option Description
-D property=value Sets the given Hadoop configuration property to the given value. Overrides any default
or site properties in the configuration, and any properties set via the -conf option.
-conf filename ... Adds the given files to the list of resources in the configuration. This is a convenient way
to set site properties or to set a number of properties at once.
-fs uri Sets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri
-jt host:port Sets the jobtracker to the given host and port. Shortcut for -D
-files file1,file2,... Copies the specified files from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS) and makes
them available to MapReduce programs in the task’s working directory. (See “Distributed
Cache” on page 288 for more on the distributed cache mechanism for copying files to
tasktracker machines.)
Copies the specified archives from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS), unarchives
Configuring the Development Environment | 153
Option Description
them, and makes them available to MapReduce programs in the task’s working
-libjars jar1,jar2,... Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS), and adds them
to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that
a job is dependent on.
Writing a Unit Test
The map anu ieuuce lunctions in MapReuuce aie easy to test in isolation, which is a
conseguence ol theii lunctional style. Foi known inputs, they piouuce known outputs.
Howevei, since outputs aie wiitten to a Context (oi an OutputCollector in the olu API),
iathei than simply Leing ietuineu liom the methou call, the Context neeus to Le ie-
placeu with a mock so that its outputs can Le veiilieu. Theie aie seveial ]ava mock
oLject liamewoiks that can help Luilu mocks; heie we use Mockito, which is noteu loi
its clean syntax, although any mock liamewoik shoulu woik just as well.
All ol the tests uesciiLeu heie can Le iun liom within an IDE.
The test loi the mappei is shown in Example 5-+.
Exanp|c 5-1. Unit tcst jor MaxTcnpcraturcMappcr
import static org.mockito.Mockito.*;
import java.io.IOException;
import org.apache.hadoop.io.*;
import org.junit.*;
public class MaxTemperatureMapperTest {
public void processesValidRecord() throws IOException, InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
mapper.map(null, value, context);
1. See also the MRUnit pioject (http://incubator.apachc.org/nrunit/), which aims to make unit testing
MapReuuce piogiams easiei.
154 | Chapter 5: Developing a MapReduce Application

verify(context).write(new Text("1950"), new IntWritable(-11));
The test is veiy simple: it passes a weathei iecoiu as input to the mappei, then checks
the output is the yeai anu tempeiatuie ieauing. The input key is ignoieu Ly the mappei,
so we can pass in anything, incluuing null as we uo heie. To cieate a mock Context,
we call Mockito`s mock() methou (a static impoit), passing the class ol the type we want
to mock. Then we invoke the mappei`s map() methou, which executes the coue Leing
testeu. Finally, we veiily that the mock oLject was calleu with the coiiect methou anu
aiguments, using Mockito`s verify() methou (again, statically impoiteu). Heie we
veiily that Context`s write() methou was calleu with a Text oLject iepiesenting the yeai
(1950) anu an IntWritable iepiesenting the tempeiatuie (-1.1C).
Pioceeuing in a test-uiiven lashion, we cieate a Mapper implementation that passes the
test (see Example 5-5). Since we will Le evolving the classes in this chaptei, each is put
in a uilleient package inuicating its veision loi ease ol exposition. Foi example, v1.Max
TemperatureMapper is veision 1 ol MaxTemperatureMapper. In ieality, ol couise, you woulu
evolve classes without iepackaging them.
Exanp|c 5-5. Iirst vcrsion oj a Mappcr that passcs MaxTcnpcraturcMappcrTcst
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
int airTemperature = Integer.parseInt(line.substring(87, 92));
context.write(new Text(year), new IntWritable(airTemperature));
This is a veiy simple implementation, which pulls the yeai anu tempeiatuie lielus liom
the line anu wiites them to the Context. Let`s auu a test loi missing values, which in
the iaw uata aie iepiesenteu Ly a tempeiatuie ol +9999:
public void ignoresMissingTemperatureRecord() throws IOException,
InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();

Text value = new Text("0043011990999991950051518004+68750+023550FM-12+0382" +
// Year ^^^^
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
Writing a Unit Test | 155

mapper.map(null, value, context);

verify(context, never()).write(any(Text.class), any(IntWritable.class));
Since iecoius with missing tempeiatuies shoulu Le lilteieu out, this test uses Mockito
to veiily that the write() methou on the Context is ncvcr calleu loi any Text key oi
IntWritable value.
The existing test lails with a NumberFormatException, as parseInt() cannot paise integeis
with a leauing plus sign, so we lix up the implementation (veision 2) to hanule missing
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

String line = value.toString();
String year = line.substring(15, 19);
String temp = line.substring(87, 92);
if (!missing(temp)) {
int airTemperature = Integer.parseInt(temp);
context.write(new Text(year), new IntWritable(airTemperature));

private boolean missing(String temp) {
return temp.equals("+9999");
Vith the test loi the mappei passing, we move on to wiiting the ieuucei.
The ieuucei has to linu the maximum value loi a given key. Heie`s a simple test loi
this leatuie:
public void returnsMaximumIntegerInValues() throws IOException,
InterruptedException {
MaxTemperatureReducer reducer = new MaxTemperatureReducer();

Text key = new Text("1950");
List<IntWritable> values = Arrays.asList(
new IntWritable(10), new IntWritable(5));
MaxTemperatureReducer.Context context =

reducer.reduce(key, values, context);

verify(context).write(key, new IntWritable(10));
156 | Chapter 5: Developing a MapReduce Application
Ve constiuct a list ol some IntWritable values anu then veiily that
MaxTemperatureReducer picks the laigest. The coue in Example 5-6 is loi an implemen-
tation ol MaxTemperatureReducer that passes the test. Notice that we haven`t testeu the
case ol an empty values iteiatoi, Lut aiguaLly we uon`t neeu to, since MapReuuce
woulu nevei call the ieuucei in this case, as eveiy key piouuceu Ly a mappei has a value.
Exanp|c 5-ó. Rcduccr jor naxinun tcnpcraturc cxanp|c
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));
Running Locally on Test Data
Now that we`ve got the mappei anu ieuucei woiking on contiolleu inputs, the next
step is to wiite a joL uiivei anu iun it on some test uata on a uevelopment machine.
Running a Job in a Local Job Runner
Using the Tool inteilace intiouuceu eailiei in the chaptei, it`s easy to wiite a uiivei to
iun oui MapReuuce joL loi linuing the maximum tempeiatuie Ly yeai (see
MaxTemperatureDriver in Example 5-7).
Exanp|c 5-7. App|ication to jind thc naxinun tcnpcraturc
public class MaxTemperatureDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
return -1;

Job job = new Job(getConf(), "Max temperature");
Running Locally on Test Data | 157
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);
MaxTemperatureDriver implements the Tool inteilace, so we get the Lenelit ol Leing aLle
to set the options that GenericOptionsParser suppoits. The run() methou constiucts
Job oLject Laseu on the tool`s conliguiation, which it uses to launch a joL. Among the
possiLle joL conliguiation paiameteis, we set the input anu output lile paths, the map-
pei, ieuucei anu comLinei classes, anu the output types (the input types aie ueteimineu
Ly the input loimat, which uelaults to TextInputFormat anu has LongWritable keys anu
Text values). It`s also a goou iuea to set a name loi the joL (Max temperature), so that
you can pick it out in the joL list uuiing execution anu altei it has completeu. By uelault,
the name is the name ol the ]AR lile, which is noimally not paiticulaily uesciiptive.
Now we can iun this application against some local liles. Hauoop comes with a local
joL iunnei, a cut-uown veision ol the MapReuuce execution engine loi iunning Map-
Reuuce joLs in a single ]VM. It`s uesigneu loi testing anu is veiy convenient loi use in
an IDE, since you can iun it in a ueLuggei to step thiough the coue in youi mappei anu
The local joL iunnei is only uesigneu loi simple testing ol MapReuuce
piogiams, so inevitaLly it uilleis liom the lull MapReuuce implemen-
tation. The Liggest uilleience is that it can`t iun moie than one ieuucei.
(It can suppoit the zeio ieuucei case, too.) This is noimally not a pioL-
lem, as most applications can woik with one ieuucei, although on a
clustei you woulu choose a laigei numLei to take auvantage ol paial-
lelism. The thing to watch out loi is that even il you set the numLei ol
ieuuceis to a value ovei one, the local iunnei will silently ignoie the
setting anu use a single ieuucei.
This limitation may Le iemoveu in a lutuie veision ol Hauoop.
158 | Chapter 5: Developing a MapReduce Application
The local joL iunnei is enaLleu Ly a conliguiation setting. Noimally,
mapred.job.tracker is a host:port paii to specily the auuiess ol the joLtiackei, Lut when
it has the special value ol local, the joL is iun in-piocess without accessing an exteinal
Fiom the commanu line, we can iun the uiivei Ly typing:
% hadoop v2.MaxTemperatureDriver -conf conf/hadoop-local.xml \
input/ncdc/micro output
Eguivalently, we coulu use the -fs anu -jt options pioviueu Ly GenericOptionsParser:
% hadoop v2.MaxTemperatureDriver -fs file:/// -jt local input/ncdc/micro output
This commanu executes MaxTemperatureDriver using input liom the local input/ncdc/
nicro uiiectoiy, piouucing output in the local output uiiectoiy. Note that although
we`ve set -fs so we use the local lilesystem (file:///), the local joL iunnei will actually
woik line against any lilesystem, incluuing HDFS (anu it can Le hanuy to uo this il you
have a lew liles that aie on HDFS).
Vhen we iun the piogiam, it lails anu piints the lollowing exception:
java.lang.NumberFormatException: For input string: "+0000"
Fixing the mapper
This exception shows that the map methou still can`t paise positive tempeiatuies. (Il
the stack tiace haun`t given us enough inloimation to uiagnose the lault, we coulu iun
the test in a local ueLuggei, since it iuns in a single ]VM.) Eailiei, we maue it hanule
the special case ol missing tempeiatuie, +9999, Lut not the geneial case ol any positive
tempeiatuie. Vith moie logic going into the mappei, it makes sense to lactoi out a
paisei class to encapsulate the paising logic; see Example 5-S (now on veision 3).
Exanp|c 5-8. A c|ass jor parsing wcathcr rccords in NCDC jornat
public class NcdcRecordParser {

private static final int MISSING_TEMPERATURE = 9999;

private String year;
private int airTemperature;
private String quality;

public void parse(String record) {
year = record.substring(15, 19);
String airTemperatureString;
// Remove leading plus sign as parseInt doesn't like them
if (record.charAt(87) == '+') {
airTemperatureString = record.substring(88, 92);
} else {
airTemperatureString = record.substring(87, 92);
airTemperature = Integer.parseInt(airTemperatureString);
quality = record.substring(92, 93);
Running Locally on Test Data | 159

public void parse(Text record) {
public boolean isValidTemperature() {
return airTemperature != MISSING_TEMPERATURE && quality.matches("[01459]");

public String getYear() {
return year;
public int getAirTemperature() {
return airTemperature;
The iesulting mappei is much simplei (see Example 5-9). It just calls the paisei`s
parse() methou, which paises the lielus ol inteiest liom a line ol input, checks whethei
a valiu tempeiatuie was lounu using the isValidTemperature() gueiy methou, anu il it
was, ietiieves the yeai anu the tempeiatuie using the gettei methous on the paisei.
Notice that we also check the guality status lielu as well as missing tempeiatuies in
isValidTemperature() to liltei out pooi tempeiatuie ieauings.
Anothei Lenelit ol cieating a paisei class is that it makes it easy to wiite ielateu mappeis
loi similai joLs without uuplicating coue. It also gives us the oppoitunity to wiite unit
tests uiiectly against the paisei, loi moie taigeteu testing.
Exanp|c 5-9. A Mappcr that uscs a uti|ity c|ass to parsc rccords
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
context.write(new Text(parser.getYear()),
new IntWritable(parser.getAirTemperature()));
Vith these changes, the test passes.
160 | Chapter 5: Developing a MapReduce Application
Testing the Driver
Apait liom the llexiLle conliguiation options olleieu Ly making youi application im-
plement Tool, you also make it moie testaLle Lecause it allows you to inject an aiLitiaiy
Configuration. You can take auvantage ol this to wiite a test that uses a local joL iunnei
to iun a joL against known input uata, which checks that the output is as expecteu.
Theie aie two appioaches to uoing this. The liist is to use the local joL iunnei anu iun
the joL against a test lile on the local lilesystem. The coue in Example 5-10 gives an
iuea ol how to uo this.
Exanp|c 5-10. A tcst jor MaxTcnpcraturcDrivcr that uscs a |oca|, in-proccss job runncr
public void test() throws Exception {
Configuration conf = new Configuration();
conf.set("fs.default.name", "file:///");
conf.set("mapred.job.tracker", "local");

Path input = new Path("input/ncdc/micro");
Path output = new Path("output");

FileSystem fs = FileSystem.getLocal(conf);
fs.delete(output, true); // delete old output

MaxTemperatureDriver driver = new MaxTemperatureDriver();

int exitCode = driver.run(new String[] {
input.toString(), output.toString() });
assertThat(exitCode, is(0));

checkOutput(conf, output);
The test explicitly sets fs.default.name anu mapred.job.tracker so it uses the local
lilesystem anu the local joL iunnei. It then iuns the MaxTemperatureDriver via its Tool
inteilace against a small amount ol known uata. At the enu ol the test, the checkOut
put() methou is calleu to compaie the actual output with the expecteu output, line Ly
The seconu way ol testing the uiivei is to iun it using a ¨mini-¨ clustei. Hauoop has a
paii ol testing classes, calleu MiniDFSCluster anu MiniMRCluster, which pioviue a pio-
giammatic way ol cieating in-piocess clusteis. Unlike the local joL iunnei, these allow
testing against the lull HDFS anu MapReuuce machineiy. Beai in minu, too, that task-
tiackeis in a mini-clustei launch sepaiate ]VMs to iun tasks in, which can make ue-
Lugging moie uillicult.
Mini-clusteis aie useu extensively in Hauoop`s own automateu test suite, Lut they can
Le useu loi testing usei coue, too. Hauoop`s ClusterMapReduceTestCase aLstiact class
pioviues a uselul Lase loi wiiting such a test, hanules the uetails ol staiting anu stopping
Running Locally on Test Data | 161
the in-piocess HDFS anu MapReuuce clusteis in its setUp() anu tearDown() methous,
anu geneiates a suitaLle conliguiation oLject that is set up to woik with them. SuL-
classes neeu populate only uata in HDFS (peihaps Ly copying liom a local lile), iun a
MapReuuce joL, then conliim the output is as expecteu. Relei to the MaxTemperature
DriverMiniTest class in the example coue that comes with this Look loi the listing.
Tests like this seive as iegiession tests, anu aie a uselul iepositoiy ol input euge cases
anu theii expecteu iesults. As you encountei moie test cases, you can simply auu them
to the input lile anu upuate the lile ol expecteu output accoiuingly.
Running on a Cluster
Now that we aie happy with the piogiam iunning on a small test uataset, we aie ieauy
to tiy it on the lull uataset on a Hauoop clustei. Chaptei 9 coveis how to set up a lully
uistiiLuteu clustei, although you can also woik thiough this section on a pseuuo-
uistiiLuteu clustei.
Ve uon`t neeu to make any mouilications to the piogiam to iun on a clustei iathei
than on a single machine, Lut we uo neeu to package the piogiam as a ]AR lile to senu
to the clustei. This is conveniently achieveu using Ant, using a task such as this (you
can linu the complete Luilu lile in the example coue):
destfile="hadoop-examples.jar" basedir="${classes.dir}"/>
Il you have a single joL pei ]AR, then you can specily the main class to iun in the ]AR
lile`s manilest. Il the main class is not in the manilest, then it must Le specilieu on the
commanu line (as you will see shoitly). Also, any uepenuent ]AR liles shoulu Le pack-
ageu in a |ib suLuiiectoiy in the ]AR lile. (This is analogous to a ]ava Wcb app|ication
archivc, oi VAR lile, except in that case the ]AR liles go in a WEB-|NI/|ib suLuiiectoiy
in the VAR lile.)
Launching a Job
To launch the joL, we neeu to iun the uiivei, specilying the clustei that we want to iun
the joL on with the -conf option (we coulu egually have useu the -fs anu -jt options):
% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -conf conf/hadoop-cluster.xml \
input/ncdc/all max-temp
The waitForCompletion() methou on Job launches the joL anu polls loi piogiess, wiit-
ing a line summaiizing the map anu ieuuce`s piogiess whenevei eithei changes. Heie`s
the output (some lines have Leen iemoveu loi claiity):
09/04/11 08:15:52 INFO mapred.FileInputFormat: Total input paths to process : 101
09/04/11 08:15:53 INFO mapred.JobClient: Running job: job_200904110811_0002
162 | Chapter 5: Developing a MapReduce Application
09/04/11 08:15:54 INFO mapred.JobClient: map 0% reduce 0%
09/04/11 08:16:06 INFO mapred.JobClient: map 28% reduce 0%
09/04/11 08:16:07 INFO mapred.JobClient: map 30% reduce 0%
09/04/11 08:21:36 INFO mapred.JobClient: map 100% reduce 100%
09/04/11 08:21:38 INFO mapred.JobClient: Job complete: job_200904110811_0002
09/04/11 08:21:38 INFO mapred.JobClient: Counters: 19
09/04/11 08:21:38 INFO mapred.JobClient: Job Counters
09/04/11 08:21:38 INFO mapred.JobClient: Launched reduce tasks=32
09/04/11 08:21:38 INFO mapred.JobClient: Rack-local map tasks=82
09/04/11 08:21:38 INFO mapred.JobClient: Launched map tasks=127
09/04/11 08:21:38 INFO mapred.JobClient: Data-local map tasks=45
09/04/11 08:21:38 INFO mapred.JobClient: FileSystemCounters
09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_READ=12667214
09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_READ=33485841275
09/04/11 08:21:38 INFO mapred.JobClient: FILE_BYTES_WRITTEN=989397
09/04/11 08:21:38 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=904
09/04/11 08:21:38 INFO mapred.JobClient: Map-Reduce Framework
09/04/11 08:21:38 INFO mapred.JobClient: Reduce input groups=100
09/04/11 08:21:38 INFO mapred.JobClient: Combine output records=4489
09/04/11 08:21:38 INFO mapred.JobClient: Map input records=1209901509
09/04/11 08:21:38 INFO mapred.JobClient: Reduce shuffle bytes=19140
09/04/11 08:21:38 INFO mapred.JobClient: Reduce output records=100
09/04/11 08:21:38 INFO mapred.JobClient: Spilled Records=9481
09/04/11 08:21:38 INFO mapred.JobClient: Map output bytes=10282306995
09/04/11 08:21:38 INFO mapred.JobClient: Map input bytes=274600205558
09/04/11 08:21:38 INFO mapred.JobClient: Combine input records=1142482941
09/04/11 08:21:38 INFO mapred.JobClient: Map output records=1142478555
09/04/11 08:21:38 INFO mapred.JobClient: Reduce input records=103
The output incluues moie uselul inloimation. Beloie the joL staits, its ID is piinteu:
this is neeueu whenevei you want to ielei to the joL, in logliles loi example, oi when
inteiiogating it via the hadoop job commanu. Vhen the joL is complete, its statistics
(known as counteis) aie piinteu out. These aie veiy uselul loi conliiming that the joL
uiu what you expecteu. Foi example, loi this joL we can see that aiounu 275 GB ol
input uata was analyzeu (¨Map input Lytes¨), ieau liom aiounu 3+ GB ol compiesseu
liles on HDFS (¨HDFS¸BYTES¸READ¨). The input was Lioken into 101 gzippeu liles
ol ieasonaLle size, so theie was no pioLlem with not Leing aLle to split them.
Job, Task, and Task Attempt IDs
The loimat ol a joL ID is composeu ol the time that the joLtiackei (not the joL) staiteu
anu an inciementing countei maintaineu Ly the joLtiackei to uniguely iuentily the joL
to that instance ol the joLtiackei. So the joL with this ID:
is the seconu (0002, joL IDs aie 1-Laseu) joL iun Ly the joLtiackei which staiteu at
0S:11 on Apiil 11, 2009. The countei is loimatteu with leauing zeios to make joL IDs
soit nicely÷in uiiectoiy listings, loi example. Howevei, when the countei ieaches
10000 it is not ieset, iesulting in longei joL IDs (which uon`t soit so well).
Running on a Cluster | 163
Tasks Lelong to a joL, anu theii IDs aie loimeu Ly ieplacing the job pielix ol a joL ID
with a task pielix, anu auuing a sullix to iuentily the task within the joL. Foi example:
is the louith (000003, task IDs aie 0-Laseu) map (m) task ol the joL with ID
job_200904110811_0002. The task IDs aie cieateu loi a joL when it is initializeu, so they
uo not necessaiily uictate the oiuei that the tasks will Le executeu in.
Tasks may Le executeu moie than once, uue to lailuie (see ¨Task Fail-
uie¨ on page 200) oi speculative execution (see ¨Speculative Execu-
tion¨ on page 213), so to iuentily uilleient instances ol a task execution, task attempts
aie given unigue IDs on the joLtiackei. Foi example:
is the liist (0, attempt IDs aie 0-Laseu) attempt at iunning task
task_200904110811_0002_m_000003. Task attempts aie allocateu uuiing the joL iun as
neeueu, so theii oiueiing iepiesents the oiuei that they weie cieateu loi tasktiackeis
to iun.
The linal count in the task attempt ID is inciementeu Ly 1,000 il the joL is iestaiteu
altei the joLtiackei is iestaiteu anu iecoveis its iunning joLs (although this Lehavioi is
uisaLleu Ly uelault÷see ¨]oLtiackei Failuie¨ on page 202).
The MapReduce Web UI
Hauoop comes with a weL UI loi viewing inloimation aLout youi joLs. It is uselul loi
lollowing a joL`s piogiess while it is iunning, as well as linuing joL statistics anu logs
altei the joL has completeu. You can linu the UI at http://jobtrac|cr-host:50030/.
The jobtracker page
A scieenshot ol the home page is shown in Figuie 5-1. The liist section ol the page gives
uetails ol the Hauoop installation, such as the veision numLei anu when it was com-
pileu, anu the cuiient state ol the joLtiackei (in this case, iunning), anu when it was
Next is a summaiy ol the clustei, which has measuies ol clustei capacity anu utilization.
This shows the numLei ol maps anu ieuuces cuiiently iunning on the clustei, the total
numLei ol joL suLmissions, the numLei ol tasktiackei noues cuiiently availaLle, anu
the clustei`s capacity: in teims ol the numLei ol map anu ieuuce slots availaLle acioss
the clustei (¨Map Task Capacity¨ anu ¨Reuuce Task Capacity¨), anu the numLei ol
availaLle slots pei noue, on aveiage. The numLei ol tasktiackeis that have Leen Llack-
listeu Ly the joLtiackei is listeu as well (Llacklisting is uiscusseu in ¨Tasktiackei Fail-
uie¨ on page 201).
Below the summaiy, theie is a section aLout the joL scheuulei that is iunning (heie the
uelault). You can click thiough to see joL gueues.
164 | Chapter 5: Developing a MapReduce Application
Fuithei uown, we see sections loi iunning, (successlully) completeu, anu laileu joLs.
Each ol these sections has a taLle ol joLs, with a iow pei joL that shows the joL`s ID,
ownei, name (as set in the Job constiuctoi oi setJobName() methou, Loth ol which
inteinally set the mapred.job.name piopeity) anu piogiess inloimation.
Finally, at the loot ol the page, theie aie links to the joLtiackei`s logs, anu the joL-
tiackei`s histoiy: inloimation on all the joLs that the joLtiackei has iun. The main view
uisplays only 100 joLs (conliguiaLle via the mapred.jobtracker.completeuserjobs.max
imum piopeity), Leloie consigning them to the histoiy page. Note also that the joL his-
toiy is peisistent, so you can linu joLs heie liom pievious iuns ol the joLtiackei.
Iigurc 5-1. Scrccnshot oj thc jobtrac|cr pagc
Running on a Cluster | 165
Job History
job history ieleis to the events anu conliguiation loi a completeu joL. It is ietaineu
whethei the joL was successlul oi not, in an attempt to pioviue inteiesting inloimation
loi the usei iunning a joL.
]oL histoiy liles aie stoieu on the local lilesystem ol the joLtiackei in a history suLuii-
ectoiy ol the logs uiiectoiy. It is possiLle to set the location to an aiLitiaiy Hauoop
lilesystem via the hadoop.job.history.location property. The joLtiackei`s histoiy liles
aie kept loi 30 uays Leloie Leing ueleteu Ly the system.
A seconu copy is also stoieu loi the usei in the _|ogs/history suLuiiectoiy ol the
joL`s output uiiectoiy. This location may Le oveiiiuuen Ly setting
hadoop.job.history.user.location. By setting it to the special value none, no usei joL
histoiy is saveu, although joL histoiy is still saveu centially. A usei`s joL histoiy liles
aie nevei ueleteu Ly the system.
The histoiy log incluues joL, task, anu attempt events, all ol which aie stoieu in a plain-
text lile. The histoiy loi a paiticulai joL may Le vieweu thiough the weL UI, oi via the
commanu line, using hadoop job -history (which you point at the joL`s output
The job page
Clicking on a joL ID Liings you to a page loi the joL, illustiateu in Figuie 5-2. At the
top ol the page is a summaiy ol the joL, with Lasic inloimation such as joL ownei anu
name, anu how long the joL has Leen iunning loi. The joL lile is the consoliuateu
conliguiation lile loi the joL, containing all the piopeities anu theii values that weie in
ellect uuiing the joL iun. Il you aie unsuie ol what a paiticulai piopeity was set to, you
can click thiough to inspect the lile.
Vhile the joL is iunning, you can monitoi its piogiess on this page, which peiiouically
upuates itsell. Below the summaiy is a taLle that shows the map piogiess anu the ieuuce
piogiess. ¨Num Tasks¨ shows the total numLei ol map anu ieuuce tasks loi this joL
(a iow loi each). The othei columns then show the state ol these tasks: ¨Penuing¨
(waiting to iun), ¨Running,¨ ¨Complete¨ (successlully iun), ¨Killeu¨ (tasks that have
laileu÷this column woulu Le moie accuiately laLeleu ¨Faileu¨). The linal column
shows the total numLei ol laileu anu killeu task attempts loi all the map oi ieuuce tasks
loi the joL (task attempts may Le maikeu as killeu il they aie a speculative execution
uuplicate, il the tasktiackei they aie iunning on uies oi il they aie killeu Ly a usei). See
¨Task Failuie¨ on page 200 loi Lackgiounu on task lailuie.
Fuithei uown the page, you can linu completion giaphs loi each task that show theii
piogiess giaphically. The ieuuce completion giaph is uiviueu into the thiee phases ol
the ieuuce task: copy (when the map outputs aie Leing tiansleiieu to the ieuuce`s
tasktiackei), soit (when the ieuuce inputs aie Leing meigeu), anu ieuuce (when the
166 | Chapter 5: Developing a MapReduce Application
ieuuce lunction is Leing iun to piouuce the linal output). The phases aie uesciiLeu in
moie uetail in ¨Shullle anu Soit¨ on page 205.
In the miuule ol the page is a taLle ol joL counteis. These aie uynamically upuateu
uuiing the joL iun, anu pioviue anothei uselul winuow into the joL`s piogiess anu
geneial health. Theie is moie inloimation aLout what these counteis mean in ¨Built-
in Counteis¨ on page 257.
Retrieving the Results
Once the joL is linisheu, theie aie vaiious ways to ietiieve the iesults. Each ieuucei
piouuces one output lile, so theie aie 30 pait liles nameu part-r-00000 to part-
r-00029 in the nax-tcnp uiiectoiy.
As theii names suggest, a goou way to think ol these ¨pait¨ liles is as
paits ol the nax-tcnp ¨lile.¨
Il the output is laige (which it isn`t in this case), then it is impoitant to
have multiple paits so that moie than one ieuucei can woik in paiallel.
Usually, il a lile is in this paititioneu loim, it can still Le useu easily
enough: as the input to anothei MapReuuce joL, loi example. In some
cases, you can exploit the stiuctuie ol multiple paititions to uo a map-
siue join, loi example, (¨Map-Siue ]oins¨ on page 2S2) oi a MapFile
lookup (¨An application: Paititioneu MapFile lookups¨ on page 269).
This joL piouuces a veiy small amount ol output, so it is convenient to copy it liom
HDFS to oui uevelopment machine. The -getmerge option to the hadoop fs commanu
is uselul heie, as it gets all the liles in the uiiectoiy specilieu in the souice pattein anu
meiges them into a single lile on the local lilesystem:
% hadoop fs -getmerge max-temp max-temp-local
% sort max-temp-local | tail
1991 607
1992 605
1993 567
1994 568
1995 567
1996 561
1997 565
1998 568
1999 568
2000 558
Ve soiteu the output, as the ieuuce output paititions aie unoiueieu (owing to the hash
paitition lunction). Doing a Lit ol postpiocessing ol uata liom MapReuuce is veiy
common, as is leeuing it into analysis tools, such as R, a spieausheet, oi even a ielational
Running on a Cluster | 167
Iigurc 5-2. Scrccnshot oj thc job pagc
168 | Chapter 5: Developing a MapReduce Application
Anothei way ol ietiieving the output il it is small is to use the -cat option to piint the
output liles to the console:
% hadoop fs -cat max-temp/*
On closei inspection, we see that some ol the iesults uon`t look plausiLle. Foi instance,
the maximum tempeiatuie loi 1951 (not shown heie) is 590C! How uo we linu out
what`s causing this? Is it coiiupt input uata oi a Lug in the piogiam?
Debugging a Job
The time-honoieu way ol ueLugging piogiams is via piint statements, anu this is cei-
tainly possiLle in Hauoop. Howevei, theie aie complications to consiuei: with pio-
giams iunning on tens, hunuieus, oi thousanus ol noues, how uo we linu anu examine
the output ol the ueLug statements, which may Le scatteieu acioss these noues? Foi
this paiticulai case, wheie we aie looking loi (what we think is) an unusual case, we
can use a ueLug statement to log to stanuaiu eiioi, in conjunction with a message to
upuate the task`s status message to piompt us to look in the eiioi log. The weL UI
makes this easy, as we will see.
Ve also cieate a custom countei to count the total numLei ol iecoius with implausiLle
tempeiatuies in the whole uataset. This gives us valuaLle inloimation aLout how to
ueal with the conuition÷il it tuins out to Le a common occuiience, then we might
neeu to leain moie aLout the conuition anu how to extiact the tempeiatuie in these
cases, iathei than simply uiopping the iecoiu. In lact, when tiying to ueLug a joL, you
shoulu always ask youisell il you can use a countei to get the inloimation you neeu to
linu out what`s happening. Even il you neeu to use logging oi a status message, it may
Le uselul to use a countei to gauge the extent ol the pioLlem. (Theie is moie on counteis
in ¨Counteis¨ on page 257.)
Il the amount ol log uata you piouuce in the couise ol ueLugging is laige, then you`ve
got a couple ol options. The liist is to wiite the inloimation to the map`s output, iathei
than to stanuaiu eiioi, loi analysis anu aggiegation Ly the ieuuce. This appioach usu-
ally necessitates stiuctuial changes to youi piogiam, so stait with the othei technigues
liist. Alteinatively, you can wiite a piogiam (in MapReuuce ol couise) to analyze the
logs piouuceu Ly youi joL.
Ve auu oui ueLugging to the mappei (veision +), as opposeu to the ieuucei, as we
want to linu out what the souice uata causing the anomalous output looks like:
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {

private NcdcRecordParser parser = new NcdcRecordParser();
Running on a Cluster | 169
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
if (airTemperature > 1000) {
System.err.println("Temperature over 100 degrees for input: " + value);
context.setStatus("Detected possibly corrupt record: see logs.");
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
Il the tempeiatuie is ovei 100C (iepiesenteu Ly 1000, since tempeiatuies aie in tenths
ol a uegiee), we piint a line to stanuaiu eiioi with the suspect line, as well as upuating
the map`s status message using the setStatus() methou on Context uiiecting us to look
in the log. Ve also inciement a countei, which in ]ava is iepiesenteu Ly a lielu ol an
enum type. In this piogiam, we have uelineu a single lielu OVER_100 as a way to count
the numLei ol iecoius with a tempeiatuie ol ovei 100C.
Vith this mouilication, we iecompile the coue, ie-cieate the ]AR lile, then ieiun the
joL, anu while it`s iunning go to the tasks page.
The tasks page
The joL page has a numLei ol links loi look at the tasks in a joL in moie uetail. Foi
example, Ly clicking on the ¨map¨ link, you aie Liought to a page that lists inloimation
loi all ol the map tasks on one page. You can also see just the completeu tasks. The
scieenshot in Figuie 5-3 shows a poition ol this page loi the joL iun with oui ueLugging
statements. Each iow in the taLle is a task, anu it pioviues such inloimation as the stait
anu enu times loi each task, any eiiois iepoiteu Lack liom the tasktiackei, anu a link
to view the counteis loi an inuiviuual task.
The ¨Status¨ column can Le helplul loi ueLugging, since it shows a task`s latest status
message. Beloie a task staits, it shows its status as ¨initializing,¨ then once it staits
ieauing iecoius it shows the split inloimation loi the split it is ieauing as a lilename
with a Lyte ollset anu length. You can see the status we set loi ueLugging loi task
task_200904110811_0003_m_000044, so let`s click thiough to the logs page to linu the
associateu ueLug message. (Notice, too, that theie is an extia countei loi this task, since
oui usei countei has a nonzeio count loi this task.)
The task details page
Fiom the tasks page, you can click on any task to get moie inloimation aLout it. The
task uetails page, shown in Figuie 5-+, shows each task attempt. In this case, theie was
170 | Chapter 5: Developing a MapReduce Application
one task attempt, which completeu successlully. The taLle pioviues luithei uselul uata,
such as the noue the task attempt ian on, anu links to task logliles anu counteis.
The ¨Actions¨ column contains links loi killing a task attempt. By uelault, this is uis-
aLleu, making the weL UI a ieau-only inteilace. Set webinterface.private.actions to
true to enaLle the actions links.
Iigurc 5-3. Scrccnshot oj thc tas|s pagc
Iigurc 5-1. Scrccnshot oj thc tas| dctai|s pagc
By setting webinterface.private.actions to true, you also allow anyone
with access to the HDFS weL inteilace to uelete liles. The dfs.web.ugi
piopeity ueteimines the usei that the HDFS weL UI iuns as, thus con-
tiolling which liles may Le vieweu anu ueleteu.
Running on a Cluster | 171
Foi map tasks, theie is also a section showing which noues the input split was locateu
By lollowing one ol the links to the logliles loi the successlul task attempt (you can see
the last + KB oi S KB ol each loglile, oi the entiie lile), we can linu the suspect input
iecoiu that we loggeu (the line is wiappeu anu tiuncateu to lit on the page):
Temperature over 100 degrees for input:
0335999999433181957042302005+37950+139117SAO +0004RJSN V020113590031500703569999994
33201957010100005+35317+139650SAO +000899999V02002359002650076249N004000599+0067...
This iecoiu seems to Le in a uilleient loimat to the otheis. Foi one thing, theie aie
spaces in the line, which aie not uesciiLeu in the specilication.
Vhen the joL has linisheu, we can look at the value ol the countei we uelineu to see
how many iecoius ovei 100C theie aie in the whole uataset. Counteis aie accessiLle
via the weL UI oi the commanu line:
% hadoop job -counter job_200904110811_0003 'v4.MaxTemperatureMapper$Temperature' \
The -counter option takes the joL ID, countei gioup name (which is the lully gualilieu
classname heie), anu the countei name (the enum name). Theie aie only thiee mal-
loimeu iecoius in the entiie uataset ol ovei a Lillion iecoius. Thiowing out Lau iecoius
is stanuaiu loi many Lig uata pioLlems, although we neeu to Le caielul in this case,
since we aie looking loi an extieme value÷the maximum tempeiatuie iathei than an
aggiegate measuie. Still, thiowing away thiee iecoius is pioLaLly not going to change
the iesult.
Handling malformed data
Captuiing input uata that causes a pioLlem is valuaLle, as we can use it in a test to
check that the mappei uoes the iight thing:
public void parsesMalformedTemperature() throws IOException,
InterruptedException {
MaxTemperatureMapper mapper = new MaxTemperatureMapper();
Text value = new Text("0335999999433181957042302005+37950+139117SAO +0004" +
// Year ^^^^
"RJSN V02011359003150070356999999433201957010100005+353");
// Temperature ^^^^^
MaxTemperatureMapper.Context context =
Counter counter = mock(Counter.class);

mapper.map(null, value, context);

verify(context, never()).write(any(Text.class), any(IntWritable.class));
172 | Chapter 5: Developing a MapReduce Application
The iecoiu that was causing the pioLlem is ol a uilleient loimat to the othei lines we`ve
seen. Example 5-11 shows a mouilieu piogiam (veision 5) using a paisei that ignoies
each line with a tempeiatuie lielu that uoes not have a leauing sign (plus oi minus).
Ve`ve also intiouuceu a countei to measuie the numLei ol iecoius that we aie ignoiing
loi this ieason.
Exanp|c 5-11. Mappcr jor naxinun tcnpcraturc cxanp|c
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {

enum Temperature {
private NcdcRecordParser parser = new NcdcRecordParser();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
Hadoop Logs
Hauoop piouuces logs in vaiious places, loi vaiious auuiences. These aie summaiizeu
in TaLle 5-2.
Tab|c 5-2. Typcs oj Hadoop |ogs
Logs Primary audience Description Further information
System daemon logs Administrators Each Hadoop daemon produces a logfile (us-
ing log4j) and another file that combines
standard out and error. Written in the direc-
tory defined by the HADOOP_LOG_DIR en-
vironment variable.
“System log-
files” on page 307 and
“Logging” on page 349.
HDFS audit logs Administrators A log of all HDFS requests, turned off by de-
fault. Written to the namenode’s log, al-
though this is configurable.
“Audit Log-
ging” on page 344.
Running on a Cluster | 173
Logs Primary audience Description Further information
MapReduce job history logs Users A log of the events (such as task completion)
that occur in the course of running a job.
Saved centrally on the jobtracker, and in the
job’s output directory in a _logs/history sub-
“Job His-
tory” on page 166.
MapReduce task logs Users Each tasktracker child process produces a
logfile using log4j (called syslog), a file for
data sent to standard out (stdout), and a file
for standard error (stderr). Written in the
userlogs subdirectory of the directory defined
by the HADOOP_LOG_DIR environment
This section.
As we have seen in the pievious section, MapReuuce task logs aie accessiLle thiough
the weL UI, which is the most convenient way to view them. You can also linu the
logliles on the local lilesystem ol the tasktiackei that ian the task attempt, in a uiiectoiy
nameu Ly the task attempt. Il task ]VM ieuse is enaLleu (¨Task ]VM Re-
use¨ on page 216), then each loglile accumulates the logs loi the entiie ]VM iun, so
multiple task attempts will Le lounu in each loglile. The weL UI hiues this Ly showing
only the poition that is ielevant loi the task attempt Leing vieweu.
It is stiaightloiwaiu to wiite to these logliles. Anything wiitten to stanuaiu output, oi
stanuaiu eiioi, is uiiecteu to the ielevant loglile. (Ol couise, in Stieaming, stanuaiu
output is useu loi the map oi ieuuce output, so it will not show up in the stanuaiu
output log.)
In ]ava, you can wiite to the task`s sys|og lile il you wish Ly using the Apache Commons
Logging API. This is shown in Example 5-12.
Exanp|c 5-12. An idcntity nappcr that writcs to standard output and a|so uscs thc Apachc Connons
Logging AP|
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.mapreduce.Mapper;
public class LoggingIdentityMapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT>

private static final Log LOG = LogFactory.getLog(LoggingIdentityMapper.class);

public void map(KEYIN key, VALUEIN value, Context context)
throws IOException, InterruptedException {
// Log to stdout file
System.out.println("Map key: " + key);

// Log to syslog file
LOG.info("Map key: " + key);
174 | Chapter 5: Developing a MapReduce Application
if (LOG.isDebugEnabled()) {
LOG.debug("Map value: " + value);
context.write((KEYOUT) key, (VALUEOUT) value);
The uelault log level is INFO, so DEBUG level messages uo not appeai in the sys|og task
log lile. Howevei, sometimes you want to see these messages÷to uo this set
mapred.map.child.log.level oi mapred.reduce.child.log.level, as appiopiiate (liom
0.22). Foi example, in this case we coulu set it loi the mappei to see the map values in
the log as lollows:
% hadoop jar hadoop-examples.jar LoggingDriver -conf conf/hadoop-cluster.xml \
-D mapred.map.child.log.level=DEBUG input/ncdc/sample.txt logging-out
Theie aie some contiols loi managing ietention anu size ol task logs. By uelault, logs
aie ueleteu altei a minimum ol 2+ houis (set using the mapred.userlog.retain.hours
piopeity). You can also set a cap on the maximum size ol each loglile using the
mapred.userlog.limit.kb piopeity, which is 0 Ly uelault, meaning theie is no cap.
Sometimes you may neeu to ueLug a pioLlem that you suspect is oc-
cuiiing in the ]VM iunning a Hauoop commanu, iathei than on the
clustei. You can senu DEBUG level logs to the console Ly using an invo-
cation like this:
% HADOOP_ROOT_LOGGER=DEBUG,console hadoop fs -text /foo/bar
Remote Debugging
Vhen a task lails anu theie is not enough inloimation loggeu to uiagnose the eiioi,
you may want to iesoit to iunning a ueLuggei loi that task. This is haiu to aiiange
when iunning the joL on a clustei, as you uon`t know which noue is going to piocess
which pait ol the input, so you can`t set up youi ueLuggei aheau ol the lailuie. Howevei,
theie aie a lew othei options availaLle:
Rcproducc thc jai|urc |oca||y
Olten the lailing task lails consistently on a paiticulai input. You can tiy to iepo-
uuce the pioLlem locally Ly uownloauing the lile that the task is lailing on anu
iunning the joL locally, possiLly using a ueLuggei such as ]ava`s VisualVM.
Usc j\M dcbugging options
A common cause ol lailuie is a ]ava out ol memoiy eiioi in the task ]VM. You can
set mapred.child.java.opts to incluue -XX:-HeapDumpOnOutOfMemoryError -XX:Heap
DumpPath=/path/to/dumps to piouuce a heap uump which can Le examineu altei-
waius with tools like jhat oi the Eclipse Memoiy Analyzei. Note that the ]VM
options shoulu Le auueu to the existing memoiy settings specilieu Ly
Running on a Cluster | 175
mapred.child.java.opts. These aie explaineu in moie uetail in ¨Mem-
oiy¨ on page 305.
Usc tas| proji|ing
]ava piolileis give a lot ol insight into the ]VM, anu Hauoop pioviues a mechanism
to piolile a suLset ol the tasks in a joL. See ¨Pioliling Tasks¨ on page 177.
Usc IsolationRunner
Oluei veisions ol Hauoop pioviueu a special task iunnei calleu IsolationRunner
that coulu ieiun laileu tasks in situ on the clustei. Unloitunately, it is no longei
availaLle in iecent veisions, Lut you can tiack its ieplacement at https://issucs
In some cases it`s uselul to keep the inteimeuiate liles loi a laileu task attempt loi latei
inspection, paiticulaily il supplementaiy uump oi piolile liles aie cieateu in the task`s
woiking uiiectoiy. You can set keep.failed.task.files to true to keep a laileu task`s
You can keep the inteimeuiate liles loi successlul tasks, too, which may Le hanuy il
you want to examine a task that isn`t lailing. In this case, set the piopeity
keep.task.files.pattern to a iegulai expiession that matches the IDs ol the tasks you
want to keep.
To examine the inteimeuiate liles, log into the noue that the task laileu on anu look loi
the uiiectoiy loi that task attempt. It will Le unuei one ol the local MapReuuce uiiec-
toiies, as set Ly the mapred.local.dir piopeity (coveieu in moie uetail in ¨Impoitant
Hauoop Daemon Piopeities¨ on page 309). Il this piopeity is a comma-sepaiateu list
ol uiiectoiies (to spieau loau acioss the physical uisks on a machine), then you may
neeu to look in all ol the uiiectoiies Leloie you linu the uiiectoiy loi that paiticulai
task attempt. The task attempt uiiectoiy is in the lollowing location:

Tuning a Job
Altei a joL is woiking, the guestion many uevelopeis ask is, ¨Can I make it iun lastei?¨
Theie aie a lew Hauoop-specilic ¨usual suspects¨ that aie woith checking to see il they
aie iesponsiLle loi a peiloimance pioLlem. You shoulu iun thiough the checklist in
TaLle 5-3 Leloie you stait tiying to piolile oi optimize at the task level.
176 | Chapter 5: Developing a MapReduce Application
Tab|c 5-3. Tuning chcc||ist
Area Best practice Further information
Number of
How long are your mappers running for? If they are only running for a few seconds
on average, then you should see if there’s a way to have fewer mappers and
make them all run longer, a minute or so, as a rule of thumb. The extent to
which this is possible depends on the input format you are using.
“Small files and Com-
mat” on page 237
Number of reducers For maximum performance, the number of reducers should be slightly less than
the number of reduce slots in the cluster. This allows the reducers to finish in
one wave and fully utilizes the cluster during the reduce phase.
“Choosing the Num-
ber of Reduc-
ers” on page 229
Combiners Can your job take advantage of a combiner to reduce the amount of data in
passing through the shuffle?
“Combiner Func-
tions” on page 34
Job execution time can almost always benefit from enabling map output
“Compressing map
output” on page 94
If you are using your own custom Writable objects, or custom comparators,
then make sure you have implemented RawComparator.
“Implementing a
RawComparator for
speed” on page 108
Shuffle tweaks The MapReduce shuffle exposes around a dozen tuning parameters for memory
management, which may help you eke out the last bit of performance.
“Configuration Tun-
ing” on page 209
Profiling Tasks
Like ueLugging, pioliling a joL iunning on a uistiiLuteu system like MapReuuce
piesents some challenges. Hauoop allows you to piolile a liaction ol the tasks in a joL,
anu, as each task completes, pulls uown the piolile inloimation to youi machine loi
latei analysis with stanuaiu pioliling tools.
Ol couise, it`s possiLle, anu somewhat easiei, to piolile a joL iunning in the local joL
iunnei. Anu pioviueu you can iun with enough input uata to exeicise the map anu
ieuuce tasks, this can Le a valuaLle way ol impioving the peiloimance ol youi mappeis
anu ieuuceis. Theie aie a couple ol caveats, howevei. The local joL iunnei is a veiy
uilleient enviionment liom a clustei, anu the uata llow patteins aie veiy uilleient.
Optimizing the CPU peiloimance ol youi coue may Le pointless il youi MapReuuce
joL is I/O-Lounu (as many joLs aie). To Le suie that any tuning is ellective, you shoulu
compaie the new execution time with the olu iunning on a ieal clustei. Even this is
easiei saiu than uone, since joL execution times can vaiy uue to iesouice contention
with othei joLs anu the uecisions the scheuulei makes to uo with task placement. To
get a goou iuea ol joL execution time unuei these ciicumstances, peiloim a seiies ol
iuns (with anu without the change) anu check whethei any impiovement is statistically
It`s unloitunately tiue that some pioLlems (such as excessive memoiy use) can Le ie-
piouuceu only on the clustei, anu in these cases the aLility to piolile in situ is
Tuning a Job | 177
The HPROF profiler
Theie aie a numLei ol conliguiation piopeities to contiol pioliling, which aie also
exposeu via convenience methous on JobConf. The lollowing mouilication to
MaxTemperatureDriver (veision 6) will enaLle iemote HPROF pioliling. HPROF is a
pioliling tool that comes with the ]DK that, although Lasic, can give valuaLle inloi-
mation aLout a piogiam`s CPU anu heap usage:
Configuration conf = getConf();
conf.setBoolean("mapred.task.profile", true);
conf.set("mapred.task.profile.params", "-agentlib:hprof=cpu=samples," +
conf.set("mapred.task.profile.maps", "0-2");
conf.set("mapred.task.profile.reduces", ""); // no reduces
Job job = new Job(conf, "Max temperature");
The liist line enaLles pioliling, which Ly uelault is tuineu oll. (Insteau ol using
mapred.task.profile you can also use the JobContext.TASK_PROFILE constant in the new
Next we set the piolile paiameteis, which aie the extia commanu-line aiguments to
pass to the task`s ]VM. (Vhen pioliling is enaLleu, a new ]VM is allocateu loi each
task, even il ]VM ieuse is tuineu on; see ¨Task ]VM Reuse¨ on page 216.) The uelault
paiameteis specily the HPROF piolilei; heie we set an extia HPROF option, depth=6,
to give moie stack tiace uepth than the HPROF uelault. (Using JobContext.TASK_PRO
FILE_PARAMS is eguivalent to setting the mapred.task.profile.params piopeity.)
Finally, we specily which tasks we want to piolile. Ve noimally only want piolile
inloimation liom a lew tasks, so we use the piopeities mapred.task.profile.maps anu
mapred.task.profile.reduces to specily the iange ol (map oi ieuuce) task IDs that we
want piolile inloimation loi. Ve`ve set the maps piopeity to 0-2 (which is actually the
uelault), which means map tasks with IDs 0, 1, anu 2 aie piolileu. A set ol ianges is
peimitteu, using a notation that allows open ianges. Foi example, 0-1,4,6- woulu
specily all tasks except those with IDs 2, 3, anu 5. The tasks to piolile can also Le
contiolleu using the JobContext.NUM_MAP_PROFILES constant loi map tasks, anu JobCon
text.NUM_REDUCE_PROFILES loi ieuuce tasks.
Vhen we iun a joL with the mouilieu uiivei, the piolile output tuins up at the enu ol
the joL in the uiiectoiy we launcheu the joL liom. Since we aie only pioliling a lew
tasks, we can iun the joL on a suLset ol the uataset.
Heie`s a snippet ol one ol the mappei`s piolile liles, which shows the CPU sampling
2. HPROF uses Lyte coue inseition to piolile youi coue, so you uo not neeu to iecompile youi application
with special options to use it. Foi moie inloimation on HPROF, see ¨HPROF: A Heap/CPU Pioliling
Tool in ]2SE 5.0,¨ Ly Kelly O`Haii at http://java.sun.con/dcvc|opcr/tcchnica|Artic|cs/Progranning/
178 | Chapter 5: Developing a MapReduce Application
CPU SAMPLES BEGIN (total = 1002) Sat Apr 11 11:17:52 2009
rank self accum count trace method
1 3.49% 3.49% 35 307969 java.lang.Object.<init>
2 3.39% 6.89% 34 307954 java.lang.Object.<init>
3 3.19% 10.08% 32 307945 java.util.regex.Matcher.<init>
4 3.19% 13.27% 32 307963 java.lang.Object.<init>
5 3.19% 16.47% 32 307973 java.lang.Object.<init>
Cioss-ieleiencing the tiace numLei 307973 gives us the stacktiace liom the same lile:
TRACE 307973: (thread=200001)
So it looks like the mappei is spenuing 3º ol its time constiucting IntWritable oLjects.
This oLseivation suggests that it might Le woith ieusing the Writable instances Leing
output (veision 7, see Example 5-13).
Exanp|c 5-13. Rcusing thc Tcxt and |ntWritab|c output objccts
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
enum Temperature {
private NcdcRecordParser parser = new NcdcRecordParser();
private Text year = new Text();
private IntWritable temp = new IntWritable();

public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
context.write(year, temp);
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
Howevei, we know il this is signilicant only il we can measuie an impiovement when
iunning the joL ovei the whole uataset. Running each vaiiant live times on an otheiwise
guiet 11-noue clustei showeu no statistically signilicant uilleience in joL execution
time. Ol couise, this iesult holus only loi this paiticulai comLination ol coue, uata,
Tuning a Job | 179
anu haiuwaie, so you shoulu peiloim similai Lenchmaiks to see whethei such a change
is signilicant loi youi setup.
Other profilers
The mechanism loi ietiieving piolile output is HPROF-specilic, so il you use anothei
piolilei you will neeu to manually ietiieve the piolilei`s output liom tasktiackeis loi
Il the piolilei is not installeu on all the tasktiackei machines, consiuei using the Dis-
tiiLuteu Cache (¨DistiiLuteu Cache¨ on page 2SS) loi making the piolilei Linaiy
availaLle on the ieguiieu machines.
MapReduce Workflows
So lai in this chaptei, you have seen the mechanics ol wiiting a piogiam using Map-
Reuuce. Ve haven`t yet consiueieu how to tuin a uata piocessing pioLlem into the
MapReuuce mouel.
The uata piocessing you have seen so lai in this Look is to solve a laiily simple pioLlem
(linuing the maximum iecoiueu tempeiatuie loi given yeais). Vhen the piocessing
gets moie complex, this complexity is geneially manilesteu Ly having moie MapReuuce
joLs, iathei than having moie complex map anu ieuuce lunctions. In othei woius, as
a iule ol thumL, think aLout auuing norc joLs, iathei than auuing complexity to joLs.
Foi moie complex pioLlems, it is woith consiueiing a highei-level language than Map-
Reuuce, such as Pig, Hive, Cascauing, Cascalog, oi Ciunch. One immeuiate Lenelit is
that it liees you up liom having to uo the tianslation into MapReuuce joLs, allowing
you to concentiate on the analysis you aie peiloiming.
Finally, the Look Data-|ntcnsivc Tcxt Proccssing with MapRcducc Ly ]immy Lin anu
Chiis Dyei (Moigan e Claypool PuLlisheis, 2010, http://naprcducc.nc/) is a gieat ie-
souice loi leaining moie aLout MapReuuce algoiithm uesign, anu is highly
Decomposing a Problem into MapReduce Jobs
Let`s look at an example ol a moie complex pioLlem that we want to tianslate into a
MapReuuce woikllow.
Imagine that we want to linu the mean maximum iecoiueu tempeiatuie loi eveiy uay
ol the yeai anu eveiy weathei station. In conciete teims, to calculate the mean maxi-
mum uaily tempeiatuie iecoiueu Ly station 029070-99999, say, on ]anuaiy 1, we take
the mean ol the maximum uaily tempeiatuies loi this station loi ]anuaiy 1, 1901;
]anuaiy 1, 1902; anu so on up to ]anuaiy 1, 2000.
180 | Chapter 5: Developing a MapReduce Application
How can we compute this using MapReuuce? The computation uecomposes most
natuially into two stages:
1. Compute the maximum uaily tempeiatuie loi eveiy station-uate paii.
The MapReuuce piogiam in this case is a vaiiant ol the maximum tempeiatuie
piogiam, except that the keys in this case aie a composite station-uate paii, iathei
than just the yeai.
2. Compute the mean ol the maximum uaily tempeiatuies loi eveiy station-uay-
month key.
The mappei takes the output liom the pievious joL (station-uate, maximum tem-
peiatuie) iecoius anu piojects it into (station-uay-month, maximum tempeiatuie)
iecoius Ly uiopping the yeai component. The ieuuce lunction then takes the mean
ol the maximum tempeiatuies loi each station-uay-month key.
The output liom liist stage looks like this loi the station we aie inteiesteu in (the
ncan_nax_dai|y_tcnp.sh sciipt in the examples pioviues an implementation in
Hauoop Stieaming):
029070-99999 19010101 0
029070-99999 19020101 -94
The liist two lielus loim the key, anu the linal column is the maximum tempeiatuie
liom all the ieauings loi the given station anu uate. The seconu stage aveiages these
uaily maxima ovei yeais to yielu:
029070-99999 0101 -68
which is inteipieteu as saying the mean maximum uaily tempeiatuie on ]anuaiy 1 loi
station 029070-99999 ovei the centuiy is -6.SC.
It`s possiLle to uo this computation in one MapReuuce stage, Lut it takes moie woik
on the pait ol the piogiammei.
The aiguments loi having moie (Lut simplei) MapReuuce stages aie that uoing so leaus
to moie composaLle anu moie maintainaLle mappeis anu ieuuceis. The case stuuies
in Chaptei 16 covei a wiue iange ol ieal-woilu pioLlems that weie solveu using Map-
Reuuce, anu in each case, the uata piocessing task is implementeu using two oi moie
MapReuuce joLs. The uetails in that chaptei aie invaluaLle loi getting a Lettei iuea ol
how to uecompose a piocessing pioLlem into a MapReuuce woikllow.
It`s possiLle to make map anu ieuuce lunctions even moie composaLle than we have
uone. A mappei commonly peiloims input loimat paising, piojection (selecting the
ielevant lielus), anu lilteiing (iemoving iecoius that aie not ol inteiest). In the mappeis
you have seen so lai, we have implementeu all ol these lunctions in a single mappei.
Howevei, theie is a case loi splitting these into uistinct mappeis anu chaining them
3. It`s an inteiesting exeicise to uo this. Hint: use ¨Seconuaiy Soit¨ on page 276.
MapReduce Workflows | 181
into a single mappei using the ChainMapper liLiaiy class that comes with Hauoop.
ComLineu with a ChainReducer, you can iun a chain ol mappeis, lolloweu Ly a ieuucei
anu anothei chain ol mappeis in a single MapReuuce joL.
Vhen theie is moie than one joL in a MapReuuce woikllow, the guestion aiises: how
uo you manage the joLs so they aie executeu in oiuei? Theie aie seveial appioaches,
anu the main consiueiation is whethei you have a lineai chain ol joLs, oi a moie com-
plex uiiecteu acyclic giaph (DAG) ol joLs.
Foi a lineai chain, the simplest appioach is to iun each joL one altei anothei, waiting
until a joL completes successlully Leloie iunning the next:
Il a joL lails, the runJob() methou will thiow an IOException, so latei joLs in the pipeline
uon`t get executeu. Depenuing on youi application, you might want to catch the ex-
ception anu clean up any inteimeuiate uata that was piouuceu Ly any pievious joLs.
Foi anything moie complex than a lineai chain, theie aie liLiaiies that can help oi-
chestiate youi woikllow (although they aie suiteu to lineai chains, oi even one-oll joLs,
too). The simplest is in the org.apache.hadoop.mapreduce.jobcontrol package: the
JobControl class. (Theie is an eguivalent class in the org.apache.hadoop.mapred.jobcon
trol package too.) An instance ol JobControl iepiesents a giaph ol joLs to Le iun. You
auu the joL conliguiations, then tell the JobControl instance the uepenuencies Letween
joLs. You iun the JobControl in a thieau, anu it iuns the joLs in uepenuency oiuei. You
can poll loi piogiess, anu when the joLs have linisheu, you can gueiy loi all the joLs`
statuses anu the associateu eiiois loi any lailuies. Il a joL lails, JobControl won`t iun
its uepenuencies.
Apache Oozie
Il you neeu to iun a complex woikllow, oi one on a tight piouuction scheuule, oi you
have a laige numLei ol connecteu woikllows with uata uepenuencies Letween them,
then a moie sophisticateu appioach is ieguiieu. Apachc Oozic
(http://incubator.apachc.org/oozic/) lits the Lill in any oi all ol these cases. It has Leen
uesigneu to manage the executions ol thousanus ol uepenuent woikllows, each com-
poseu ol possiLly thousanus ol consistuent actions at the level ol an inuiviuual Map-
Reuuce joL.
Oozie has two main paits: a wor|j|ow cnginc that stoies anu iuns woikllows composeu
ol Hauoop joLs, anu a coordinator cnginc that iuns woikllow joLs Laseu on pieuelineu
scheuules anu uata availaLility. The lattei piopeity is especially poweilul since it allows
a woikllow joL to wait until its input uata has Leen piouuceu Ly a uepenuent woikllow;
also, it make ieiunning laileu woikllows moie tiactaLle, since no time is wasteu iunning
182 | Chapter 5: Developing a MapReduce Application
successlul paits ol a woikllow. Anyone who has manageu a complex Latch system
knows how uillicult it can Le to catch up liom joLs misseu uue to uowntime oi lailuie,
anu will appieciate this leatuie.
Unlike JobControl, which iuns on the client machine suLmitting the joLs, Oozie iuns
as a seivice in the clustei, anu clients suLmit a woikllow uelinitions loi immeuiate oi
latei execution. In Oozie pailance, a woikllow is a DAG ol action nodcs anu contro|-
j|ow nodcs. An action noue peiloims a woikllow task, like moving liles in HDFS, iun-
ning a MapReuuce, Stieaming, Pig oi Hive joL, peiloiming a Sgoop impoit, oi iunning
an aiLitiaiy shell sciipt oi ]ava piogiam. A contiol-llow noue goveins the woikllow
execution Letween actions Ly allowing such constiucts as conuitional logic (so uilleient
execution Lianches may Le lolloweu uepenuing on the iesult ol an eailiei action noue)
oi paiallel execution. Vhen the woikllow completes, Oozie can make an HTTP call-
Lack to the client to inloim it ol the woikllow status. It is also possiLle to ieceive
callLacks eveiy time the woikllow enteis oi exits an action noue.
Defining an Oozie workflow
Voikllow uelinitions aie wiitten in XML using the Hauoop Piocess Delinition Lan-
guage, the specilication loi which can Le lounu on the Oozie weLsite. Example 5-1+
shows a simple Oozie woikllow uelinition loi iunning a single MapReuuce joL.
Exanp|c 5-11. Oozic wor|j|ow dcjinition to run thc naxinun tcnpcraturc MapRcducc job
<workflow-app xmlns="uri:oozie:workflow:0.1" name="max-temp-workflow">
<start to="max-temp-mr"/>
<action name="max-temp-mr">
<delete path="${nameNode}/user/${wf:user()}/output"/>
MapReduce Workflows | 183
<ok to="end"/>
<error to="fail"/>
<kill name="fail">
<message>MapReduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
<end name="end"/>
This woikllow has thiee contiol-llow noues anu one action noue: a start contiol noue,
a map-reduce action noue, a kill contiol noue, anu an end contiol noue. The noues anu
alloweu tiansitions Letween them aie shown in Figuie 5-5.
Iigurc 5-5. Transition diagran oj an Oozic wor|j|ow.
All woikllows must have one start anu one end noue. Vhen the woikllow joL staits
it tiansitions to the noue specilieu Ly the start noue (the max-temp-mr action in this
example). A woikllow joL succeeus when it tiansitions to the end noue. Howevei, il
the woikllow joL tiansitions to a kill noue, then it is consiueieu to have laileu anu
iepoits an eiioi message as specilieu Ly the message element in the woikllow uelinition.
The Lulk ol this woikllow uelinition lile specilies the map-reduce action. The liist two
elements, job-tracker anu name-node, aie useu to specily the joLtiackei to suLmit the
joL to, anu the namenoue (actually a Hauoop lilesystem URI) loi input anu output
uata. Both aie paiameteiizeu so that the woikllow uelinition is not tieu to a paiticulai
clustei (which makes it easy to test). The paiameteis aie specilieu as woikllow joL
piopeities at suLmission time, as we shall see latei.
184 | Chapter 5: Developing a MapReduce Application
The optional prepare element iuns Leloie the MapReuuce joL, anu is useu loi uiiectoiy
ueletion (anu cieation too, il neeueu, although that is not shown heie). By ensuiing
that the output uiiectoiy is in a consistent state Leloie iunning a joL, Oozie can salely
ieiun the action il the joL lails.
The MapReuuce joL to iun is specilieu in the configuration element using nesteu ele-
ments loi specilying the Hauoop conliguiation name-value paiis. You can view the
MapReuuce conliguiation section as a ueclaiative ieplacement loi the uiivei classes
that we have useu elsewheie in this Look loi iunning MapReuuce piogiams (such as
Example 2-6).
Theie aie two non-stanuaiu Hauoop piopeities, mapred.input.dir anu mapred.out
put.dir, which aie useu to set the FileInputFormat input paths anu FileOutputFormat
output path, iespectively.
Ve have taken auvantage ol ]SP Expiession Language (EL) lunctions in seveial places
in the woikllow uelinition. Oozie pioviues a set ol lunctions loi inteiacting with the
woikllow; ${wf:user()}, loi example, ietuins the name ol the usei who staiteu the
cuiient woikllow joL, anu we use it to specily the coiiect lilesystem path. The Oozie
specilication lists all the EL lunctions that Oozie suppoits.
Packaging and deploying an Oozie workflow application
A woikllow application is maue up ol the woikllow uelinition plus all the associateu
iesouices (such as MapReuuce ]AR liles, Pig sciipts, anu so on), neeueu to iun it.
Applications must auheie to a simple uiiectoiy stiuctuie, anu aie ueployeu to HDFS
so that they can Le accesseu Ly Oozie. Foi this woikllow application, we`ll put all ol
the liles in a Lase uiiectoiy calleu nax-tcnp-wor|j|ow, as shown uiagiamatically heie:
The woikllow uelinition lile wor|j|ow.xn| must appeai in the top-level ol this uiiectoiy.
]AR liles containing the application`s MapReuuce classes aie placeu in the |ib uiiectoiy.
Voikllow applications that conloim to this layout can Le Luilt with any suitaLle Luilu
tool, like Ant oi Maven; you can linu an example in the coue that accompanies this
Look. Once an application has Leen Luilt, it shoulu Le copieu to HDFS using iegulai
Hauoop tools. Heie is the appiopiiate commanu loi this application:
% hadoop fs -put hadoop-examples/target/max-temp-workflow max-temp-workflow
Running an Oozie workflow job
Next let`s see how to iun a woikllow joL loi the application we just uploaueu. Foi this
we use the oozie commanu line tool, a client piogiam loi communicating with an Oozie
seivei. Foi convenience we expoit the OOZIE_URL enviionment vaiiaLle to tell the
oozie commanu which Oozie seivei to use (we`ie using one iunning locally heie):
MapReduce Workflows | 185
% export OOZIE_URL="http://localhost:11000/oozie"
Theie aie lots ol suL-commanus loi the oozie tool (type oozie help to get a list), Lut
we`ie going to call the job suLcommanu with the -run option to iun the woikllow joL:
% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run
job: 0000009-120119174508294-oozie-tom-W
The -config option specilies a local ]ava piopeities lile containing uelinitions loi the
paiameteis in the woikllow XML lile (in this case nameNode anu jobTracker), as well as
oozie.wf.application.path which tells Oozie the location ol the woikllow application
in HDFS. Heie is the contents ol the piopeities lile:
To get inloimation aLout the status ol the woikllow joL we use the -info option, using
the joL ID that was piinteu Ly the iun commanu eailiei (type oozie job to get a list ol
all joLs).
% oozie job -info 0000009-120119174508294-oozie-tom-W
The output shows the status: RUNNING, KILLED, oi SUCCEEDED. You can linu all this in-
loimation via Oozie`s weL UI too, availaLle at http://|oca|host:11000/oozic.
Vhen the joL has succeeueu we can inspect the iesults in the usual way:
% hadoop fs -cat output/part-*
1949 111
1950 22
This example only sciatcheu the suilace ol wiiting Oozie woikllows. The uocumen-
tation on Oozie`s weLsite has inloimation aLout cieating moie complex woikllows, as
well as wiiting anu iunning cooiuinatoi joLs.
186 | Chapter 5: Developing a MapReduce Application
How MapReduce Works
In this chaptei, we look at how MapReuuce in Hauoop woiks in uetail. This knowleuge
pioviues a goou lounuation loi wiiting moie auvanceu MapReuuce piogiams, which
we will covei in the lollowing two chapteis.
Anatomy of a MapReduce Job Run
You can iun a MapReuuce joL with a single methou call: submit() on a Job oLject (note
that you can also call waitForCompletion(), which will suLmit the joL il it hasn`t Leen
suLmitteu alieauy, then wait loi it to linish).
This methou call conceals a gieat ueal ol
piocessing Lehinu the scenes. This section uncoveis the steps Hauoop takes to iun a
Ve saw in Chaptei 5 that the way Hauoop executes a MapReuuce piogiam uepenus
on a couple ol conliguiation settings.
In ieleases ol Hauoop up to anu incluuing the 0.20 ielease seiies, mapred.job.tracker
ueteimines the means ol execution. Il this conliguiation piopeity is set to local, the
uelault, then the local joL iunnei is useu. This iunnei iuns the whole joL in a single
]VM. It`s uesigneu loi testing anu loi iunning MapReuuce piogiams on small uatasets.
Alteinatively, il mapred.job.tracker is set to a colon-sepaiateu host anu poit paii, then
the piopeity is inteipieteu as a joLtiackei auuiess, anu the iunnei suLmits the joL to
the joLtiackei at that auuiess. The whole piocess in uesciiLeu in uetail in the next
In Hauoop 0.23.0 a new MapReuuce implementation was intiouuceu. The new im-
plementation (calleu MapReuuce 2) is Luilt on a system calleu YARN, uesciiLeu in
¨YARN (MapReuuce 2)¨ on page 19+. Foi now, just note that the liamewoik that is
useu loi execution is set Ly the mapreduce.framework.name piopeity, which takes the
values local (loi the local joL iunnei), classic (loi the ¨classic¨ MapReuuce liame-
1. In the olu MapReuuce API you can call JobClient.submitJob(conf) oi JobClient.runJob(conf).
woik, also calleu MapReuuce 1, which uses a joLtiackei anu tasktiackeis), anu yarn
(loi the new liamewoik).
It`s impoitant to iealize that the olu anu new MapReuuce APIs aie not
the same thing as the classic anu YARN-Laseu MapReuuce implemen-
tations (MapReuuce 1 anu 2 iespectively). The APIs aie usei-lacing cli-
ent-siue leatuies anu ueteimine how you wiite MapReuuce piogiams,
while the implementations aie just uilleient ways ol iunning MapRe-
uuce piogiams. All loui comLinations aie suppoiteu: Loth the olu anu
new API iun on Loth MapReuuce 1 anu 2. TaLle 1-2 lists which ol these
comLinations aie suppoiteu in the uilleient Hauoop ieleases.
Classic MapReduce (MapReduce 1)
A joL iun in classic MapReuuce is illustiateu in Figuie 6-1. At the highest level, theie
aie loui inuepenuent entities:
º The client, which suLmits the MapReuuce joL.
º The joLtiackei, which cooiuinates the joL iun. The joLtiackei is a ]ava application
whose main class is JobTracker.
º The tasktiackeis, which iun the tasks that the joL has Leen split into. Tasktiackeis
aie ]ava applications whose main class is TaskTracker.
º The uistiiLuteu lilesystem (noimally HDFS, coveieu in Chaptei 3), which is useu
loi shaiing joL liles Letween the othei entities.
Job Submission
The submit() methou on Job cieates an inteinal JobSummitter instance anu calls sub
mitJobInternal() on it (step 1 in Figuie 6-1). Having suLmitteu the joL, waitForCom
pletion() polls the joL`s piogiess once a seconu anu iepoits the piogiess to the console
il it has changeu since the last iepoit. Vhen the joL is complete, il it was successlul,
the joL counteis aie uisplayeu. Otheiwise, the eiioi that causeu the joL to lail is loggeu
to the console.
188 | Chapter 6: How MapReduce Works
The joL suLmission piocess implementeu Ly JobSummitter uoes the lollowing:
º Asks the joLtiackei loi a new joL ID (Ly calling getNewJobId() on JobTracker) (step
º Checks the output specilication ol the joL. Foi example, il the output uiiectoiy has
not Leen specilieu oi it alieauy exists, the joL is not suLmitteu anu an eiioi is
thiown to the MapReuuce piogiam.
º Computes the input splits loi the joL. Il the splits cannot Le computeu, Lecause
the input paths uon`t exist, loi example, then the joL is not suLmitteu anu an eiioi
is thiown to the MapReuuce piogiam.
º Copies the iesouices neeueu to iun the joL, incluuing the joL ]AR lile, the conlig-
uiation lile, anu the computeu input splits, to the joLtiackei`s lilesystem in a
uiiectoiy nameu altei the joL ID. The joL ]AR is copieu with a high ieplication
lactoi (contiolleu Ly the mapred.submit.replication piopeity, which uelaults to
10) so that theie aie lots ol copies acioss the clustei loi the tasktiackeis to access
when they iun tasks loi the joL (step 3).
Iigurc ó-1. How Hadoop runs a MapRcducc job using thc c|assic jrancwor|
Anatomy of a MapReduce Job Run | 189
º Tells the joLtiackei that the joL is ieauy loi execution (Ly calling submitJob() on
JobTracker) (step +).
Job Initialization
Vhen the JobTracker ieceives a call to its submitJob() methou, it puts it into an inteinal
gueue liom wheie the joL scheuulei will pick it up anu initialize it. Initialization involves
cieating an oLject to iepiesent the joL Leing iun, which encapsulates its tasks, anu
Lookkeeping inloimation to keep tiack ol the tasks` status anu piogiess (step 5).
To cieate the list ol tasks to iun, the joL scheuulei liist ietiieves the input splits com-
puteu Ly the client liom the shaieu lilesystem (step 6). It then cieates one map task loi
each split. The numLei ol ieuuce tasks to cieate is ueteimineu Ly the
mapred.reduce.tasks piopeity in the Job, which is set Ly the setNumReduceTasks()
methou, anu the scheuulei simply cieates this numLei ol ieuuce tasks to Le iun. Tasks
aie given IDs at this point.
In auuition to the map anu ieuuce tasks, two luithei tasks aie cieateu: a joL setup task
anu a joL cleanup task. These aie iun Ly tasktiackeis anu aie useu to iun coue to setup
the joL Leloie any map tasks iun, anu to cleanup altei all the ieuuce tasks aie complete.
The OutputCommitter that is conliguieu loi the joL ueteimines the coue to Le iun, anu
Ly uelault this is a FileOutputCommitter. Foi the joL setup task it will cieate the linal
output uiiectoiy loi the joL anu the tempoiaiy woiking space loi the task output, anu
loi the joL cleanup task it will uelete the tempoiaiy woiking space loi the task output.
The commit piotocol is uesciiLeu in moie uetail in ¨Output Committeis¨
on page 215.
Task Assignment
Tasktiackeis iun a simple loop that peiiouically senus heaitLeat methou calls to the
joLtiackei. HeaitLeats tell the joLtiackei that a tasktiackei is alive, Lut they also uouLle
as a channel loi messages. As a pait ol the heaitLeat, a tasktiackei will inuicate whethei
it is ieauy to iun a new task, anu il it is, the joLtiackei will allocate it a task, which it
communicates to the tasktiackei using the heaitLeat ietuin value (step 7).
Beloie it can choose a task loi the tasktiackei, the joLtiackei must choose a joL to select
the task liom. Theie aie vaiious scheuuling algoiithms as explaineu latei in this chaptei
(see ¨]oL Scheuuling¨ on page 20+), Lut the uelault one simply maintains a piioiity
list ol joLs. Having chosen a joL, the joLtiackei now chooses a task loi the joL.
Tasktiackeis have a lixeu numLei ol slots loi map tasks anu loi ieuuce tasks: loi ex-
ample, a tasktiackei may Le aLle to iun two map tasks anu two ieuuce tasks simulta-
neously. (The piecise numLei uepenus on the numLei ol coies anu the amount ol
memoiy on the tasktiackei; see ¨Memoiy¨ on page 305.) The uelault scheuulei lills
empty map task slots Leloie ieuuce task slots, so il the tasktiackei has at least one
empty map task slot, the joLtiackei will select a map task; otheiwise, it will select a
ieuuce task.
190 | Chapter 6: How MapReduce Works
To choose a ieuuce task, the joLtiackei simply takes the next in its list ol yet-to-Le-iun
ieuuce tasks, since theie aie no uata locality consiueiations. Foi a map task, howevei,
it takes account ol the tasktiackei`s netwoik location anu picks a task whose input split
is as close as possiLle to the tasktiackei. In the optimal case, the task is data-|oca|, that
is, iunning on the same noue that the split iesiues on. Alteinatively, the task may Le
rac|-|oca|: on the same iack, Lut not the same noue, as the split. Some tasks aie neithei
uata-local noi iack-local anu ietiieve theii uata liom a uilleient iack liom the one they
aie iunning on. You can tell the piopoition ol each type ol task Ly looking at a joL`s
counteis (see ¨Built-in Counteis¨ on page 257).
Task Execution
Now that the tasktiackei has Leen assigneu a task, the next step is loi it to iun the task.
Fiist, it localizes the joL ]AR Ly copying it liom the shaieu lilesystem to the tasktiackei`s
lilesystem. It also copies any liles neeueu liom the uistiiLuteu cache Ly the application
to the local uisk; see ¨DistiiLuteu Cache¨ on page 2SS (step S). Seconu, it cieates a
local woiking uiiectoiy loi the task, anu un-jais the contents ol the ]AR into this
uiiectoiy. Thiiu, it cieates an instance ol TaskRunner to iun the task.
TaskRunner launches a new ]ava Viitual Machine (step 9) to iun each task in (step 10),
so that any Lugs in the usei-uelineu map anu ieuuce lunctions uon`t allect the task-
tiackei (Ly causing it to ciash oi hang, loi example). It is, howevei, possiLle to ieuse
the ]VM Letween tasks; see ¨Task ]VM Reuse¨ on page 216.
The chilu piocess communicates with its paient thiough the unbi|ica| inteilace. This
way it inloims the paient ol the task`s piogiess eveiy lew seconus until the task is
Each task can peiloim setup anu cleanup actions, which aie iun in the same ]VM as
the task itsell, anu aie ueteimineu Ly the OutputCommitter loi the joL (see ¨Output
Committeis¨ on page 215). The cleanup action is useu to commit the task, which in
the case ol lile-Laseu joLs means that its output is wiitten to the linal location loi that
task. The commit piotocol ensuies that when speculative execution is enaLleu (¨Spec-
ulative Execution¨ on page 213), only one ol the uuplicate tasks is committeu anu the
othei is aLoiteu.
Both Stieaming anu Pipes iun special map anu ieuuce tasks loi the
puipose ol launching the usei-supplieu executaLle anu communicating with it (Fig-
uie 6-2).
In the case ol Stieaming, the Stieaming task communicates with the piocess (which
may Le wiitten in any language) using stanuaiu input anu output stieams. The Pipes
task, on the othei hanu, listens on a socket anu passes the C-- piocess a poit numLei
in its enviionment, so that on staitup, the C-- piocess can estaLlish a peisistent socket
connection Lack to the paient ]ava Pipes task.
Streaming and Pipes.
Anatomy of a MapReduce Job Run | 191
In Loth cases, uuiing execution ol the task, the ]ava piocess passes input key-value
paiis to the exteinal piocess, which iuns it thiough the usei-uelineu map oi ieuuce
lunction anu passes the output key-value paiis Lack to the ]ava piocess. Fiom the
tasktiackei`s point ol view, it is as il the tasktiackei chilu piocess ian the map oi ieuuce
coue itsell.
Progress and Status Updates
MapReuuce joLs aie long-iunning Latch joLs, taking anything liom minutes to houis
to iun. Because this is a signilicant length ol time, it`s impoitant loi the usei to get
leeuLack on how the joL is piogiessing. A joL anu each ol its tasks have a status, which
incluues such things as the state ol the joL oi task (e.g., iunning, successlully completeu,
laileu), the piogiess ol maps anu ieuuces, the values ol the joL`s counteis, anu a status
message oi uesciiption (which may Le set Ly usei coue). These statuses change ovei
the couise ol the joL, so how uo they get communicateu Lack to the client?
Vhen a task is iunning, it keeps tiack ol its progrcss, that is, the piopoition ol the task
completeu. Foi map tasks, this is the piopoition ol the input that has Leen piocesseu.
Foi ieuuce tasks, it`s a little moie complex, Lut the system can still estimate the pio-
poition ol the ieuuce input piocesseu. It uoes this Ly uiviuing the total piogiess into
thiee paits, coiiesponuing to the thiee phases ol the shullle (see ¨Shullle anu
Soit¨ on page 205). Foi example, il the task has iun the ieuucei on hall its input, then
the task`s piogiess is , since it has completeu the copy anu soit phases (⅓ each) anu
is hallway thiough the ieuuce phase ().
192 | Chapter 6: How MapReduce Works
Iigurc ó-2. Thc rc|ationship oj thc Strcaning and Pipcs cxccutab|c to thc tas|trac|cr and its chi|d
What Constitutes Progress in MapReduce?
Piogiess is not always measuiaLle, Lut neveitheless it tells Hauoop that a task is uoing
something. Foi example, a task wiiting output iecoius is making piogiess, even though
it cannot Le expiesseu as a peicentage ol the total numLei that will Le wiitten, since
the lattei liguie may not Le known, even Ly the task piouucing the output.
Piogiess iepoiting is impoitant, as it means Hauoop will not lail a task that`s making
piogiess. All ol the lollowing opeiations constitute piogiess:
º Reauing an input iecoiu (in a mappei oi ieuucei)
º Viiting an output iecoiu (in a mappei oi ieuucei)
º Setting the status uesciiption on a iepoitei (using Reporter`s setStatus() methou)
º Inciementing a countei (using Reporter`s incrCounter() methou)
º Calling Reporter`s progress() methou
Anatomy of a MapReduce Job Run | 193
Tasks also have a set ol counteis that count vaiious events as the task iuns (we saw an
example in ¨A test iun¨ on page 25), eithei those Luilt into the liamewoik, such as the
numLei ol map output iecoius wiitten, oi ones uelineu Ly useis.
Il a task iepoits piogiess, it sets a llag to inuicate that the status change shoulu Le sent
to the tasktiackei. The llag is checkeu in a sepaiate thieau eveiy thiee seconus, anu il
set it notilies the tasktiackei ol the cuiient task status. Meanwhile, the tasktiackei is
senuing heaitLeats to the joLtiackei eveiy live seconus (this is a minimum, as the
heaitLeat inteival is actually uepenuent on the size ol the clustei: loi laigei clusteis,
the inteival is longei), anu the status ol all the tasks Leing iun Ly the tasktiackei is sent
in the call. Counteis aie sent less lieguently than eveiy live seconus, Lecause they can
Le ielatively high-Lanuwiuth.
The joLtiackei comLines these upuates to piouuce a gloLal view ol the status ol all the
joLs Leing iun anu theii constituent tasks. Finally, as mentioneu eailiei, the Job ieceives
the latest status Ly polling the joLtiackei eveiy seconu. Clients can also use Job`s
getStatus() methou to oLtain a JobStatus instance, which contains all ol the status
inloimation loi the joL.
The methou calls aie illustiateu in Figuie 6-3.
Job Completion
Vhen the joLtiackei ieceives a notilication that the last task loi a joL is complete (this
will Le the special joL cleanup task), it changes the status loi the joL to ¨successlul.¨
Then, when the Job polls loi status, it leains that the joL has completeu successlully,
so it piints a message to tell the usei anu then ietuins liom the waitForCompletion()
The joLtiackei also senus an HTTP joL notilication il it is conliguieu to uo so. This
can Le conliguieu Ly clients wishing to ieceive callLacks, via the job.end.notifica
tion.url piopeity.
Last, the joLtiackei cleans up its woiking state loi the joL anu instiucts tasktiackeis to
uo the same (so inteimeuiate output is ueleteu, loi example).
YARN (MapReduce 2)
Foi veiy laige clusteis in the iegion ol +000 noues anu highei, the MapReuuce system
uesciiLeu in the pievious section Legins to hit scalaLility Lottlenecks, so in 2010 a gioup
at Yahoo! Legan to uesign the next geneiation ol MapReuuce. The iesult was YARN,
shoit loi Yet Anothei Resouice Negotiatoi (oi il you pielei iecuisive ancionyms, YARN
Application Resouice Negotiatoi).
2. You can ieau moie aLout the motivation loi anu uevelopment ol YARN in Aiun C Muithy`s post, The
Next Geneiation ol Apache Hauoop MapReuuce.
194 | Chapter 6: How MapReduce Works
YARN meets the scalaLility shoitcomings ol ¨classic¨ MapReuuce Ly splitting the ie-
sponsiLilities ol the joLtiackei into sepaiate entities. The joLtiackei takes caie ol Loth
joL scheuuling (matching tasks with tasktiackeis) anu task piogiess monitoiing (keep-
ing tiack ol tasks anu iestaiting laileu oi slow tasks, anu uoing task Lookkeeping such
as maintaining countei totals).
YARN sepaiates these two ioles into two inuepenuent uaemons: a rcsourcc nanagcr
to manage the use ol iesouices acioss the clustei, anu an app|ication nastcr to manage
the lilecycle ol applications iunning on the clustei. The iuea is that an application
mastei negotiates with the iesouice managei loi clustei iesouices÷uesciiLeu in teims
ol a numLei ol containcrs each with a ceitain memoiy limit÷then iuns application-
specilic piocesses in those containeis. The containeis aie oveiseen Ly nodc nanagcrs
iunning on clustei noues, which ensuie that the application uoes not use moie iesoui-
ces than it has Leen allocateu.
Iigurc ó-3. How status updatcs arc propagatcd through thc MapRcducc 1 systcn
3. At the time ol wiiting, memoiy is the only iesouice that is manageu, anu noue manageis will kill any
containeis that exceeu theii allocateu memoiy limits.
Anatomy of a MapReduce Job Run | 195
In contiast to the joLtiackei, each instance ol an application÷heie a MapReuuce joL
÷has a ueuicateu application mastei, which iuns loi the uuiation ol the application.
This mouel is actually closei to the oiiginal Google MapReuuce papei, which uesciiLes
how a mastei piocess is staiteu to cooiuinate map anu ieuuce tasks iunning on a set
ol woikeis.
As uesciiLeu, YARN is moie geneial than MapReuuce, anu in lact MapReuuce is just
one type ol YARN application. Theie aie a lew othei YARN applications÷such as a
uistiiLuteu shell that can iun a sciipt on a set ol noues in the clustei÷anu otheis aie
actively Leing woikeu on (some aie listeu at http://wi|i.apachc.org/hadoop/Powcrcd
ByYarn). The Leauty ol YARN`s uesign is that uilleient YARN applications can co-exist
on the same clustei÷so a MapReuuce application can iun at the same time as an MPI
application, loi example÷which Liings gieat Lenelits loi managaLility anu clustei
Fuitheimoie, it is even possiLle loi useis to iun uilleient veisions ol MapReuuce on
the same YARN clustei, which makes the piocess ol upgiauing MapReuuce moie
managaLle. (Note that some paits ol MapReuuce, like the joL histoiy seivei anu the
shullle hanulei, as well as YARN itsell, still neeu to Le upgiaueu acioss the clustei.)
MapReuuce on YARN involves moie entities than classic MapReuuce. They aie:
º The client, which suLmits the MapReuuce joL.
º The YARN iesouice managei, which cooiuinates the allocation ol compute ie-
souices on the clustei.
º The YARN noue manageis, which launch anu monitoi the compute containeis on
machines in the clustei.
º The MapReuuce application mastei, which cooiuinates the tasks iunning the
MapReuuce joL. The application mastei anu the MapReuuce tasks iun in con-
taineis that aie scheuuleu Ly the iesouice managei, anu manageu Ly the noue
º The uistiiLuteu lilesystem (noimally HDFS, coveieu in Chaptei 3), which is useu
loi shaiing joL liles Letween the othei entities.
The piocess ol iunning a joL is shown in Figuie 6-+, anu uesciiLeu in the lollowing
Job Submission
]oLs aie suLmitteu in MapReuuce 2 using the same usei API as MapReuuce 1 (step 1).
MapReuuce 2 has an implementation ol ClientProtocol that is activateu when mapre
+. Not uiscusseu in this section aie the joL histoiy seivei uaemon (loi ietaining joL histoiy uata) anu the
shullle hanulei auxiliaiy seivice (loi seiving map outputs to ieuuce tasks), which aie a pait ol the
joLtiackei anu the tasktiackei iespectively in classic MapReuuce, Lut aie inuepenuent entities in YARN.
196 | Chapter 6: How MapReduce Works
duce.framework.name is set to yarn. The suLmission piocess is veiy similai to the classic
implementation. The new joL ID is ietiieveu liom the iesouice managei (iathei than
the joLtiackei), although in the nomenclatuie ol YARN it is an application ID (step 2).
The joL client checks the output specilication ol the joL; computes input splits (al-
though theie is an option to geneiate them on the clustei, yarn.app.mapreduce.am.com
pute-splits-in-cluster, which can Le Lenelicial loi joLs with many splits); anu copies
joL iesouices (incluuing the joL ]AR, conliguiation, anu split inloimation) to HDFS
(step 3). Finally, the joL is suLmitteu Ly calling submitApplication() on the iesouice
managei (step +).
Job Initialization
Vhen the iesouice managei ieceives a call to its submitApplication(), it hanus oll the
ieguest to the scheuulei. The scheuulei allocates a containei, anu the iesouice managei
then launches the application mastei`s piocess theie, unuei the noue managei`s man-
agement (steps 5a anu 5L).
The application mastei loi MapReuuce joLs is a ]ava application whose main class is
MRAppMaster. It initializes the joL Ly cieating a numLei ol Lookkeeping oLjects to keep
tiack ol the joL`s piogiess, as it will ieceive piogiess anu completion iepoits liom the
tasks (step 6). Next, it ietiieves the input splits computeu in the client liom the shaieu
lilesystem (step 7). It then cieates a map task oLject loi each split, anu a numLei ol
ieuuce task oLjects ueteimineu Ly the mapreduce.job.reduces piopeity.
The next thing the application mastei uoes is ueciue how to iun the tasks that make
up the MapReuuce joL. Il the joL is small, the application mastei may choose to iun
them in the same ]VM as itsell, since it juuges the oveiheau ol allocating new containeis
anu iunning tasks in them as outweighing the gain to Le hau in iunning them in paiallel,
compaieu to iunning them seguentially on one noue. (This is uilleient to MapReuuce
1, wheie small joLs aie nevei iun on a single tasktiackei.) Such a joL is saiu to Le
ubcrizcd, oi iun as an ubcr tas|.
Vhat gualilies as a small joL? By uelault one that has less than 10 mappeis, only one
ieuucei, anu the input size is less than the size ol one HDFS Llock. (These values may
Iigurc ó-1. How Hadoop runs a MapRcducc job using YARN
Anatomy of a MapReduce Job Run | 197
Le changeu loi a joL Ly setting mapreduce.job.ubertask.maxmaps, mapreduce.job.uber
task.maxreduces, anu mapreduce.job.ubertask.maxbytes.) It`s also possiLle to uisaLle
uLei tasks entiiely (Ly setting mapreduce.job.ubertask.enable to false).
Beloie any tasks can Le iun the joL setup methou is calleu (loi the joL`s OutputCommit
ter), to cieate the joL`s output uiiectoiy. In contiast to MapReuuce 1, wheie it is calleu
in a special task that is iun Ly the tasktiackei, in the YARN implementation the methou
is calleu uiiectly Ly the application mastei.
Task Assignment
Il the joL uoes not gualily loi iunning as an uLei task, then the application mastei
ieguests containeis loi all the map anu ieuuce tasks in the joL liom the iesouice man-
agei (step S). Each ieguest, which aie piggyLackeu on heaitLeat calls, incluues inloi-
mation aLout each map task`s uata locality, in paiticulai the hosts anu coiiesponuing
iacks that the input split iesiues on. The scheuulei uses this inloimation to make
scheuuling uecisions (just like a joLtiackei`s scheuulei uoes): it attempts to place tasks
on uata-local noues in the iueal case, Lut il this is not possiLle the scheuulei pieleis
iack-local placement to non-local placement.
Reguests also specily memoiy ieguiiements loi tasks. By uelault Loth map anu ieuuce
tasks aie allocateu 102+ MB ol memoiy, Lut this is conliguiaLle Ly setting mapre
duce.map.memory.mb anu mapreduce.reduce.memory.mb.
The way memoiy is allocateu is uilleient to MapReuuce 1, wheie tasktiackeis have a
lixeu numLei ol ¨slots¨, set at clustei conliguiation time, anu each task iuns in a single
slot. Slots have a maximum memoiy allowance, which again is lixeu loi a clustei, anu
which leaus Loth to pioLlems ol unuei utilization when tasks use less memoiy (since
othei waiting tasks aie not aLle to take auvantage ol the unuseu memoiy) anu pioLlems
ol joL lailuie when a task can`t complete since it can`t get enough memoiy to iun
In YARN, iesouices aie moie line-giaineu, so Loth these pioLlems can Le avoiueu. In
paiticulai, applications may ieguest a memoiy capaLility that is anywheie Letween the
minimum allocation anu a maximum allocation, anu which must Le a multiple ol the
minimum allocation. Delault memoiy allocations aie scheuulei-specilic, anu loi the
capacity scheuulei the uelault minimum is 102+ MB (set Ly yarn.schedu
ler.capacity.minimum-allocation-mb), anu the uelault maximum is 102+0 MB (set Ly
yarn.scheduler.capacity.maximum-allocation-mb). Thus, tasks can ieguest any mem-
oiy allocation Letween 1 anu 10 GB (inclusive), in multiples ol 1 GB (the scheuulei will
iounu to the neaiest multiple il neeueu), Ly setting mapreduce.map.memory.mb anu map
reduce.reduce.memory.mb appiopiiately.
Task Execution
Once a task has Leen assigneu a containei Ly the iesouice managei`s scheuulei, the
application mastei staits the containei Ly contacting the noue managei (steps 9a anu
198 | Chapter 6: How MapReduce Works
9L). The task is executeu Ly a ]ava application whose main class is YarnChild. Beloie
it can iun the task it localizes the iesouices that the task neeus, incluuing the joL con-
liguiation anu ]AR lile, anu any liles liom the uistiiLuteu cache (step 10). Finally, it
iuns the map oi ieuuce task (step 11).
The YarnChild iuns in a ueuicateu ]VM, loi the same ieason that tasktiackeis spawn
new ]VMs loi tasks in MapReuuce 1: to isolate usei coue liom long-iunning system
uaemons. Unlike MapReuuce 1, howevei, YARN uoes not suppoit ]VM ieuse so each
task iuns in a new ]VM.
Stieaming anu Pipes piogiams woik in the same way as MapReuuce 1. The Yarn
Child launches the Stieaming oi Pipes piocess anu communicates with it using stanuaiu
input/output oi a socket (iespectively), as shown in Figuie 6-2 (except the chilu anu
suLpiocesses iun on noue manageis, not tasktiackeis).
Progress and Status Updates
Vhen iunning unuei YARN, the task iepoits its piogiess anu status (incluuing coun-
teis) Lack to its application mastei eveiy thiee seconus (ovei the umLilical inteilace),
which has an aggiegate view ol the joL. The piocess is illustiateu in Figuie 6-5. Contiast
this to MapReuuce 1, wheie piogiess upuates llow liom the chilu thiough the task-
tiackei to the joLtiackei loi aggiegation.
The client polls the application mastei eveiy seconu (set via mapreduce.client.pro
gressmonitor.pollinterval) to ieceive piogiess upuates, which aie usually uisplayeu
to the usei.
Job Completion
As well as polling the application mastei loi piogiess, eveiy live seconus the client
checks whethei the joL has completeu when using the waitForCompletion() methou on
Job. The polling inteival can Le set via the mapreduce.client.completion.polli
nterval conliguiation piopeity.
Notilication ol joL completion via an HTTP callLack is also suppoiteu like in MapRe-
uuce 1. In MapReuuce 2 the application mastei initiates the callLack.
Iigurc ó-5. How status updatcs arc propagatcd through thc MapRcducc 2 systcn
Anatomy of a MapReduce Job Run | 199
On joL completion the application mastei anu the task containeis clean up theii woik-
ing state, anu the OutputCommitter`s joL cleanup methou is calleu. ]oL inloimation is
aichiveu Ly the joL histoiy seivei to enaLle latei inteiiogation Ly useis il uesiieu.
In the ieal woilu, usei coue is Luggy, piocesses ciash, anu machines lail. One ol the
majoi Lenelits ol using Hauoop is its aLility to hanule such lailuies anu allow youi joL
to complete.
Failures in Classic MapReduce
In the MapReuuce 1 iuntime theie aie thiee lailuie moues to consiuei: lailuie ol the
iunning task, lailuie ol the tastiackei, anu lailuie ol the joLtiackei. Let`s look at each
in tuin.
Task Failure
Consiuei liist the case ol the chilu task lailing. The most common way that this happens
is when usei coue in the map oi ieuuce task thiows a iuntime exception. Il this happens,
the chilu ]VM iepoits the eiioi Lack to its paient tasktiackei, Leloie it exits. The eiioi
ultimately makes it into the usei logs. The tasktiackei maiks the task attempt as
jai|cd, lieeing up a slot to iun anothei task.
Foi Stieaming tasks, il the Stieaming piocess exits with a nonzeio exit coue, it is maikeu
as laileu. This Lehavioi is goveineu Ly the stream.non.zero.exit.is.failure piopeity
(the uelault is true).
Anothei lailuie moue is the suuuen exit ol the chilu ]VM÷peihaps theie is a ]VM Lug
that causes the ]VM to exit loi a paiticulai set ol ciicumstances exposeu Ly the Map-
Reuuce usei coue. In this case, the tasktiackei notices that the piocess has exiteu anu
maiks the attempt as laileu.
Hanging tasks aie uealt with uilleiently. The tasktiackei notices that it hasn`t ieceiveu
a piogiess upuate loi a while anu pioceeus to maik the task as laileu. The chilu ]VM
piocess will Le automatically killeu altei this peiiou.
The timeout peiiou altei which
tasks aie consiueieu laileu is noimally 10 minutes anu can Le conliguieu on a pei-joL
5. Il a Stieaming oi Pipes piocess hangs, the tasktiackei will kill it (along with the ]VM that launcheu it)
only in one the lollowing ciicumstances: eithei mapred.task.tracker.task-controller is set to
org.apache.hadoop.mapred.LinuxTaskController, oi the uelault task contiollei in Leing useu
(org.apache.hadoop.mapred.DefaultTaskController) anu the setsid commanu is availaLle on the system
(so that the chilu ]VM anu any piocesses it launches aie in the same piocess gioup). In any othei case
oiphaneu Stieaming oi Pipes piocesses will accumulate on the system, which will impact utilization ovei
200 | Chapter 6: How MapReduce Works
Lasis (oi a clustei Lasis) Ly setting the mapred.task.timeout piopeity to a value in
Setting the timeout to a value ol zeio uisaLles the timeout, so long-iunning tasks aie
nevei maikeu as laileu. In this case, a hanging task will nevei liee up its slot, anu ovei
time theie may Le clustei slowuown as a iesult. This appioach shoulu theieloie Le
avoiueu, anu making suie that a task is iepoiting piogiess peiiouically will sullice (see
¨Vhat Constitutes Piogiess in MapReuuce?¨ on page 193).
Vhen the joLtiackei is notilieu ol a task attempt that has laileu (Ly the tasktiackei`s
heaitLeat call), it will iescheuule execution ol the task. The joLtiackei will tiy to avoiu
iescheuuling the task on a tasktiackei wheie it has pieviously laileu. Fuitheimoie, il a
task lails loui times (oi moie), it will not Le ietiieu luithei. This value is conliguiaLle:
the maximum numLei ol attempts to iun a task is contiolleu Ly the
mapred.map.max.attempts piopeity loi map tasks anu mapred.reduce.max.attempts loi
ieuuce tasks. By uelault, il any task lails loui times (oi whatevei the maximum numLei
ol attempts is conliguieu to), the whole joL lails.
Foi some applications, it is unuesiiaLle to aLoit the joL il a lew tasks lail, as it may Le
possiLle to use the iesults ol the joL uespite some lailuies. In this case, the maximum
peicentage ol tasks that aie alloweu to lail without tiiggeiing joL lailuie can Le set
loi the joL. Map tasks anu ieuuce tasks aie contiolleu inuepenuently, using
the mapred.max.map.failures.percent anu mapred.max.reduce.failures.percent
A task attempt may also Le |i||cd, which is uilleient liom it lailing. A task attempt may
Le killeu Lecause it is a speculative uuplicate (loi moie, see ¨Speculative Execu-
tion¨ on page 213), oi Lecause the tasktiackei it was iunning on laileu, anu the joL-
tiackei maikeu all the task attempts iunning on it as killeu. Killeu task attempts uo
not count against the numLei ol attempts to iun the task (as set Ly
mapred.map.max.attempts anu mapred.reduce.max.attempts), since it wasn`t the task`s
lault that an attempt was killeu.
Useis may also kill oi lail task attempts using the weL UI oi the commanu line (type
hadoop job to see the options). ]oLs may also Le killeu Ly the same mechanisms.
Tasktracker Failure
Failuie ol a tasktiackei is anothei lailuie moue. Il a tasktiackei lails Ly ciashing, oi
iunning veiy slowly, it will stop senuing heaitLeats to the joLtiackei (oi senu them veiy
inlieguently). The joLtiackei will notice a tasktiackei that has stoppeu senuing heait-
Leats (il it hasn`t ieceiveu one loi 10 minutes, conliguieu via the mapred.task
tracker.expiry.interval piopeity, in milliseconus) anu iemove it liom its pool ol
tasktiackeis to scheuule tasks on. The joLtiackei aiianges loi map tasks that weie iun
anu completeu successlully on that tasktiackei to Le ieiun il they Lelong to incomplete
joLs, since theii inteimeuiate output iesiuing on the laileu tasktiackei`s local lilesystem
may not Le accessiLle to the ieuuce task. Any tasks in piogiess aie also iescheuuleu.
Failures | 201
A tasktiackei can also Le b|ac||istcd Ly the joLtiackei, even il the tasktiackei has not
laileu. Il moie than loui tasks liom the same joL lail on a paiticulai tasktiackei (set Ly
(mapred.max.tracker.failures), then the joLtiackei iecoius this as a lault. A tasktiackei
is Llacklisteu il the numLei ol laults is ovei some minimum thiesholu (loui, set Ly
mapred.max.tracker.blacklists) anu is signilicantly highei than the aveiage numLei
ol laults loi tasktiackeis in the clustei clustei.
Blacklisteu tasktiackeis aie not assigneu tasks, Lut they continue to communicate with
the joLtiackei. Faults expiie ovei time (at the iate ol one pei uay), so tasktiackeis get
the chance to iun joLs again simply Ly leaving them iunning. Alteinatively, il theie is
an unueilying lault that can Le lixeu (Ly ieplacing haiuwaie, loi example), the task-
tiackei will Le iemoveu liom the joLtiackei`s Llacklist altei it iestaits anu iejoins the
Jobtracker Failure
Failuie ol the joLtiackei is the most seiious lailuie moue. Hauoop has no mechanism
loi uealing with lailuie ol the joLtiackei÷it is a single point ol lailuie÷so in this case
the joL lails. Howevei, this lailuie moue has a low chance ol occuiiing, since the chance
ol a paiticulai machine lailing is low. The goou news is that the situation is impioveu
in YARN, since one ol its uesign goals is to eliminate single points ol lailuie in Map-
Altei iestaiting a joLtiackei, any joLs that weie iunning at the time it was stoppeu will
neeu to Le ie-suLmitteu. Theie is a conliguiation option that attempts to iecovei any
iunning joLs (mapred.jobtracker.restart.recover, tuineu oll Ly uelault), howevei it
is known not to woik ieliaLly, so shoulu not Le useu.
Failures in YARN
Foi MapReuuce piogiams iunning on YARN, we neeu to consiuei the lailuie ol any ol
the lollowing entities: the task, the application mastei, the noue managei, anu the
iesouice managei.
Task Failure
Failuie ol the iunning task is similai to the classic case. Runtime exceptions anu suuuen
exits ol the ]VM aie piopagateu Lack to the application mastei anu the task attempt is
maikeu as laileu. Likewise, hanging tasks aie noticeu Ly the application mastei Ly the
aLsence ol a ping ovei the umLilical channel (the timeout is set Ly mapreduce.task.time
out), anu again the task attempt is maikeu as laileu.
The conliguiation piopeities loi ueteimining when a task is consiueieu to Le laileu aie
the same as the classic case: a task is maikeu as laileu altei loui attempts (set Ly
mapreduce.map.maxattempts loi map tasks anu mapreduce.reduce.maxattempts loi ie-
uucei tasks). A joL will Le laileu il moie than mapreduce.map.failures.maxpercent pei-
202 | Chapter 6: How MapReduce Works
cent ol the map tasks in the joL lail, oi moie than mapreduce.reduce.failures.maxper
cent peicent ol the ieuuce tasks lail.
Application Master Failure
]ust like MapReuuce tasks aie given seveial attempts to succeeu (in the lace ol haiuwaie
oi netwoik lailuies) applications in YARN aie tiieu multiple times in the event ol lail-
uie. By uelault, applications aie maikeu as laileu il they lail once, Lut this can Le in-
cieaseu Ly setting the piopeity yarn.resourcemanager.am.max-retries.
An application mastei senus peiiouic heaitLeats to the iesouice managei, anu in the
event ol application mastei lailuie, the iesouice managei will uetect the lailuie anu
stait a new instance ol the mastei iunning in a new containei (manageu Ly a noue
managei). In the case ol the MapReuuce application mastei, it can iecovei the state ol
the tasks that hau alieauy Leen iun Ly the (laileu) application so they uon`t have to Le
ieiun. By uelault, iecoveiy is not enaLleu, so laileu application masteis will not ieiun
all theii tasks, Lut you can tuin it on Ly setting yarn.app.mapreduce.am.job.recov
ery.enable to true.
The client polls the application mastei loi piogiess iepoits, so il its application mastei
lails the client neeus to locate the new instance. Duiing joL initialization the client asks
the iesouice managei loi the application mastei`s auuiess, anu then caches it, so it
uoesn`t oveiloau the the iesouice managei with a ieguest eveiy time it neeus to poll
the application mastei. Il the application mastei lails, howevei, the client will expeii-
ence a timeout when it issues a status upuate, at which point the client will go Lack to
the iesouice managei to ask loi the new application mastei`s auuiess.
Node Manager Failure
Il a noue managei lails, then it will stop senuing heaitLeats to the iesouice managei,
anu the noue managei will Le iemoveu liom the iesouice managei`s pool ol availaLle
noues. The piopeity yarn.resourcemanager.nm.liveness-monitor.expiry-interval-
ms, which uelaults to 600000 (10 minutes), ueteimines the minimum time the iesouice
managei waits Leloie consiueiing a noue managei that has sent no heaitLeat in that
time as laileu.
Any task oi application mastei iunning on the laileu noue managei will Le iecoveieu
using the mechanisms uesciiLeu in the pievious two sections.
Noue manageis may Le Llacklisteu il the numLei ol lailuies loi the application is high.
Blacklisting is uone Ly the application mastei, anu loi MapReuuce the application
mastei will tiy to iescheuule tasks on uilleient noues il moie than thiee tasks lail on a
noue managei. The thiesholu may Le set with mapreduce.job.maxtaskfai
Failures | 203
Resource Manager Failure
Failuie ol the iesouice managei is seiious, since without it neithei joLs noi task con-
taineis can Le launcheu. The iesouice managei was uesigneu liom the outset to Le aLle
to iecovei liom ciashes, Ly using a checkpointing mechanism to save its state to pei-
sistent stoiage, although at the time ol wiiting the latest ielease uiu not have a complete
Altei a ciash, a new iesouice managei instance is Liought up (Ly an auminstiatoi) anu
it iecoveis liom the saveu state. The state consists ol the noue manageis in the system
as well as the iunning applications. (Note that tasks aie not pait ol the iesouice man-
agei`s state, since they aie manageu Ly the application. Thus the amount ol state to Le
stoieu is much moie managaLle than that ol the joLtiackei.)
The stoiage useu Ly the ieouice managei is conliguiaLle via the yarn.resourceman
ager.store.class piopeity. The uelault is org.apache.hadoop.yarn.server.resource
manager.recovery.MemStore, which keeps the stoie in memoiy, anu is theieloie not
highly-availaLle. Howevei, theie is a ZooKeepei-Laseu stoie in the woiks that will
suppoit ieliaLle iecoveiy liom iesouice managei lailuies in the lutuie.
Job Scheduling
Eaily veisions ol Hauoop hau a veiy simple appioach to scheuuling useis` joLs: they
ian in oiuei ol suLmission, using a FIFO scheuulei. Typically, each joL woulu use the
whole clustei, so joLs hau to wait theii tuin. Although a shaieu clustei olleis gieat
potential loi olleiing laige iesouices to many useis, the pioLlem ol shaiing iesouices
laiily Letween useis ieguiies a Lettei scheuulei. Piouuction joLs neeu to complete in a
timely mannei, while allowing useis who aie making smallei au hoc gueiies to get
iesults Lack in a ieasonaLle time.
Latei on, the aLility to set a joL`s piioiity was auueu, via the mapred.job.priority
piopeity oi the setJobPriority() methou on JobClient (Loth ol which take one ol the
values VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). Vhen the joL scheuulei is choosing the
next joL to iun, it selects one with the highest piioiity. Howevei, with the FIFO
scheuulei, piioiities uo not suppoit prccnption, so a high-piioiity joL can still Le
Llockeu Ly a long-iunning low piioiity joL that staiteu Leloie the high-piioiity joL was
MapReuuce in Hauoop comes with a choice ol scheuuleis. The uelault in MapReuuce
1 is the oiiginal FIFO gueue-Laseu scheuulei, anu theie aie also multiusei scheuuleis
calleu the Faii Scheuulei anu the Capacity Scheuulei.
MapReuuce 2 comes with the Capacity Scheuulei (the uelault), anu the FIFO scheuulei.
204 | Chapter 6: How MapReduce Works
The Fair Scheduler
The Faii Scheuulei aims to give eveiy usei a laii shaie ol the clustei capacity ovei time.
Il a single joL is iunning, it gets all ol the clustei. As moie joLs aie suLmitteu, liee task
slots aie given to the joLs in such a way as to give each usei a laii shaie ol the clustei.
A shoit joL Lelonging to one usei will complete in a ieasonaLle time even while anothei
usei`s long joL is iunning, anu the long joL will still make piogiess.
]oLs aie placeu in pools, anu Ly uelault, each usei gets theii own pool. A usei who
suLmits moie joLs than a seconu usei will not get any moie clustei iesouices than the
seconu, on aveiage. It is also possiLle to ueline custom pools with guaianteeu minimum
capacities uelineu in teims ol the numLei ol map anu ieuuce slots, anu to set weightings
loi each pool.
The Faii Scheuulei suppoits pieemption, so il a pool has not ieceiveu its laii shaie loi
a ceitain peiiou ol time, then the scheuulei will kill tasks in pools iunning ovei capacity
in oiuei to give the slots to the pool iunning unuei capacity.
The Faii Scheuulei is a ¨contiiL¨ mouule. To enaLle it, place its ]AR lile on Hauoop`s
classpath, Ly copying it liom Hauoop`s contrib/jairschcdu|cr uiiectoiy to the |ib uiiec-
toiy. Then set the mapred.jobtracker.taskScheduler piopeity to:
The Faii Scheuulei will woik without luithei conliguiation, Lut to take lull auvantage
ol its leatuies anu how to conliguie it (incluuing its weL inteilace), ielei to README
in the src/contrib/jairschcdu|cr uiiectoiy ol the uistiiLution.
The Capacity Scheduler
The Capacity Scheuulei takes a slightly uilleient appioach to multiusei scheuuling. A
clustei is maue up ol a numLei ol gueues (like the Faii Scheuulei`s pools), which may
Le hieiaichical (so a gueue may Le the chilu ol anothei gueue), anu each gueue has an
allocateu capacity. This is like the Faii Scheuulei, except that within each gueue, joLs
aie scheuuleu using FIFO scheuuling (with piioiities). In ellect, the Capacity Scheuulei
allows useis oi oiganizations (uelineu using gueues) to simulate a sepaiate MapReuuce
clustei with FIFO scheuuling loi each usei oi oiganization. The Faii Scheuulei, Ly
contiast, (which actually also suppoits FIFO joL scheuuling within pools as an option,
making it like the Capacity Scheuulei) enloices laii shaiing within each pool, so iunning
joLs shaie the pool`s iesouices.
Shuffle and Sort
MapReuuce makes the guaiantee that the input to eveiy ieuucei is soiteu Ly key. The
piocess Ly which the system peiloims the soit÷anu tiansleis the map outputs to the
ieuuceis as inputs÷is known as the shujj|c.
In this section, we look at how the shullle
woiks, as a Lasic unueistanuing woulu Le helplul, shoulu you neeu to optimize a Map-
Shuffle and Sort | 205
Reuuce piogiam. The shullle is an aiea ol the coueLase wheie ielinements anu
impiovements aie continually Leing maue, so the lollowing uesciiption necessaiily
conceals many uetails (anu may change ovei time, this is loi veision 0.20). In many
ways, the shullle is the heait ol MapReuuce anu is wheie the ¨magic¨ happens.
The Map Side
Vhen the map lunction staits piouucing output, it is not simply wiitten to uisk. The
piocess is moie involveu, anu takes auvantage ol Lulleiing wiites in memoiy anu uoing
some piesoiting loi elliciency ieasons. Figuie 6-6 shows what happens.
Each map task has a ciiculai memoiy Lullei that it wiites the output to. The Lullei is
100 MB Ly uelault, a size which can Le tuneu Ly changing the io.sort.mb piopeity.
Vhen the contents ol the Lullei ieaches a ceitain thiesholu size (io.sort.spill.per
cent, uelault 0.80, oi S0º), a Lackgiounu thieau will stait to spi|| the contents to uisk.
Map outputs will continue to Le wiitten to the Lullei while the spill takes place, Lut il
the Lullei lills up uuiing this time, the map will Llock until the spill is complete.
Spills aie wiitten in iounu-ioLin lashion to the uiiectoiies specilieu Ly the
mapred.local.dir piopeity, in a joL-specilic suLuiiectoiy.
Iigurc ó-ó. Shujj|c and sort in MapRcducc
Beloie it wiites to uisk, the thieau liist uiviues the uata into paititions coiiesponuing
to the ieuuceis that they will ultimately Le sent to. Vithin each paitition, the Lack-
giounu thieau peiloims an in-memoiy soit Ly key, anu il theie is a comLinei lunction,
it is iun on the output ol the soit. Running the comLinei lunction makes loi a moie
6. The teim shujj|c is actually impiecise, since in some contexts it ieleis to only the pait ol the piocess wheie
map outputs aie letcheu Ly ieuuce tasks. In this section, we take it to mean the whole piocess liom the
point wheie a map piouuces output to wheie a ieuuce consumes input.
206 | Chapter 6: How MapReduce Works
compact map output, so theie is less uata to wiite to local uisk anu to tianslei to the
Each time the memoiy Lullei ieaches the spill thiesholu, a new spill lile is cieateu, so
altei the map task has wiitten its last output iecoiu theie coulu Le seveial spill liles.
Beloie the task is linisheu, the spill liles aie meigeu into a single paititioneu anu soiteu
output lile. The conliguiation piopeity io.sort.factor contiols the maximum numLei
ol stieams to meige at once; the uelault is 10.
Il theie aie at least thiee spill liles (set Ly the min.num.spills.for.combine piopeity)
then the comLinei is iun again Leloie the output lile is wiitten. Recall that comLineis
may Le iun iepeateuly ovei the input without allecting the linal iesult. Il theie aie only
one oi two spills, then the potential ieuuction in map output size is not woith the
oveiheau in invoking the comLinei, so it is not iun again loi this map output.
It is olten a goou iuea to compiess the map output as it is wiitten to uisk, since uoing
so makes it lastei to wiite to uisk, saves uisk space, anu ieuuces the amount ol uata to
tianslei to the ieuucei. By uelault, the output is not compiesseu, Lut it is easy to enaLle
Ly setting mapred.compress.map.output to true. The compiession liLiaiy to use is speci-
lieu Ly mapred.map.output.compression.codec; see ¨Compiession¨ on page S5 loi moie
on compiession loimats.
The output lile`s paititions aie maue availaLle to the ieuuceis ovei HTTP. The maxi-
mum numLei ol woikei thieaus useu to seive the lile paititions is contiolleu Ly the
tasktracker.http.threads piopeity÷this setting is pei tasktiackei, not pei map task
slot. The uelault ol +0 may neeu incieasing loi laige clusteis iunning laige joLs. In
MapReuuce 2, this piopeity is not applicaLle since the maximum numLei ol thieaus
useu is set automatically Laseu on the numLei ol piocessois on the machine. (Map-
Reuuce 2 uses Netty, which Ly uelault allows up to twice as many thieaus as theie aie
The Reduce Side
Let`s tuin now to the ieuuce pait ol the piocess. The map output lile is sitting on the
local uisk ol the machine that ian the map task (note that although map outputs always
get wiitten to local uisk, ieuuce outputs may not Le), Lut now it is neeueu Ly the
machine that is aLout to iun the ieuuce task loi the paitition. Fuitheimoie, the ieuuce
task neeus the map output loi its paiticulai paitition liom seveial map tasks acioss the
clustei. The map tasks may linish at uilleient times, so the ieuuce task staits copying
theii outputs as soon as each completes. This is known as the copy phasc ol the ieuuce
task. The ieuuce task has a small numLei ol copiei thieaus so that it can letch map
outputs in paiallel. The uelault is live thieaus, Lut this numLei can Le changeu Ly
setting the mapred.reduce.parallel.copies piopeity.
Shuffle and Sort | 207
How uo ieuuceis know which machines to letch map output liom?
As map tasks complete successlully, they notily theii paient tasktiackei
ol the status upuate, which in tuin notilies the joLtiackei. (In MapRe-
uuce 2, the tasks notily theii application mastei uiiectly.) These notili-
cations aie tiansmitteu ovei the heaitLeat communication mechanism
uesciiLeu eailiei. Theieloie, loi a given joL, the joLtiackei (oi applica-
tion mastei) knows the mapping Letween map outputs anu hosts. A
thieau in the ieuucei peiiouically asks the mastei loi map output hosts
until it has ietiieveu them all.
Hosts uo not uelete map outputs liom uisk as soon as the liist ieuucei
has ietiieveu them, as the ieuucei may suLseguently lail. Insteau, they
wait until they aie tolu to uelete them Ly the joLtiackei (oi application
mastei), which is altei the joL has completeu.
The map outputs aie copieu to the ieuuce task ]VM`s memoiy il they aie small enough
(the Lullei`s size is contiolleu Ly mapred.job.shuffle.input.buffer.percent, which
specilies the piopoition ol the heap to use loi this puipose); otheiwise, they aie copieu
to uisk. Vhen the in-memoiy Lullei ieaches a thiesholu size (contiolleu Ly
mapred.job.shuffle.merge.percent), oi ieaches a thiesholu numLei ol map outputs
(mapred.inmem.merge.threshold), it is meigeu anu spilleu to uisk. Il a comLinei is speci-
lieu it will Le iun uuiing the meige to ieuuce the amount ol uata wiitten to uisk.
As the copies accumulate on uisk, a Lackgiounu thieau meiges them into laigei, soiteu
liles. This saves some time meiging latei on. Note that any map outputs that weie
compiesseu (Ly the map task) have to Le uecompiesseu in memoiy in oiuei to peiloim
a meige on them.
Vhen all the map outputs have Leen copieu, the ieuuce task moves into the sort
phasc (which shoulu piopeily Le calleu the ncrgc phase, as the soiting was caiiieu out
on the map siue), which meiges the map outputs, maintaining theii soit oiueiing. This
is uone in iounus. Foi example, il theie weie 50 map outputs, anu the ncrgc jactor was
10 (the uelault, contiolleu Ly the io.sort.factor piopeity, just like in the map`s meige),
then theie woulu Le 5 iounus. Each iounu woulu meige 10 liles into one, so at the enu
theie woulu Le live inteimeuiate liles.
Rathei than have a linal iounu that meiges these live liles into a single soiteu lile, the
meige saves a tiip to uisk Ly uiiectly leeuing the ieuuce lunction in what is the last
phase: the rcducc phasc. This linal meige can come liom a mixtuie ol in-memoiy anu
on-uisk segments.
208 | Chapter 6: How MapReduce Works
The numLei ol liles meigeu in each iounu is actually moie suLtle than
this example suggests. The goal is to meige the minimum numLei ol
liles to get to the meige lactoi loi the linal iounu. So il theie weie +0
liles, the meige woulu not meige 10 liles in each ol the loui iounus to
get + liles. Insteau, the liist iounu woulu meige only + liles, anu the
suLseguent thiee iounus woulu meige the lull 10 liles. The + meigeu
liles, anu the 6 (as yet unmeigeu) liles make a total ol 10 liles loi the
linal iounu. The piocess is illustiateu in Figuie 6-7.
Note that this uoes not change the numLei ol iounus, it`s just an opti-
mization to minimize the amount ol uata that is wiitten to uisk, since
the linal iounu always meiges uiiectly into the ieuuce.
Iigurc ó-7. Ejjicicnt|y ncrging 10 ji|c scgncnts with a ncrgc jactor oj 10
Duiing the ieuuce phase, the ieuuce lunction is invokeu loi each key in the soiteu
output. The output ol this phase is wiitten uiiectly to the output lilesystem, typically
HDFS. In the case ol HDFS, since the tasktiackei noue (oi noue managei) is also iun-
ning a uatanoue, the liist Llock ieplica will Le wiitten to the local uisk.
Configuration Tuning
Ve aie now in a Lettei position to unueistanu how to tune the shullle to impiove
MapReuuce peiloimance. The ielevant settings, which can Le useu on a pei-joL Lasis
(except wheie noteu), aie summaiizeu in TaLles 6-1 anu 6-2, along with the uelaults,
which aie goou loi geneial-puipose joLs.
The geneial piinciple is to give the shullle as much memoiy as possiLle. Howevei, theie
is a tiaue-oll, in that you neeu to make suie that youi map anu ieuuce lunctions get
enough memoiy to opeiate. This is why it is Lest to wiite youi map anu ieuuce lunctions
to use as little memoiy as possiLle÷ceitainly they shoulu not use an unLounueu
amount ol memoiy (Ly avoiuing accumulating values in a map, loi example).
The amount ol memoiy given to the ]VMs in which the map anu ieuuce tasks iun is
set Ly the mapred.child.java.opts piopeity. You shoulu tiy to make this as laige as
possiLle loi the amount ol memoiy on youi task noues; the uiscussion in ¨Mem-
oiy¨ on page 305 goes thiough the constiaints to consiuei.
On the map siue, the Lest peiloimance can Le oLtaineu Ly avoiuing multiple spills to
uisk; one is optimal. Il you can estimate the size ol youi map outputs, then you can set
the io.sort.* piopeities appiopiiately to minimize the numLei ol spills. In paiticulai,
Shuffle and Sort | 209
you shoulu inciease io.sort.mb il you can. Theie is a MapReuuce countei (¨Spilleu
iecoius¨; see ¨Counteis¨ on page 257) that counts the total numLei ol iecoius that
weie spilleu to uisk ovei the couise ol a joL, which can Le uselul loi tuning. Note that
the countei incluues Loth map anu ieuuce siue spills.
On the ieuuce siue, the Lest peiloimance is oLtaineu when the inteimeuiate uata can
iesiue entiiely in memoiy. By uelault, this uoes not happen, since loi the geneial case
all the memoiy is ieseiveu loi the ieuuce lunction. But il youi ieuuce lunction has light
memoiy ieguiiements, then setting mapred.inmem.merge.threshold to 0 anu
mapred.job.reduce.input.buffer.percent to 1.0 (oi a lowei value; see TaLle 6-2) may
Liing a peiloimance Loost.
Moie geneially, Hauoop uses a Lullei size ol + KB Ly uelault, which is low, so you
shoulu inciease this acioss the clustei (Ly setting io.file.buffer.size, see also ¨Othei
Hauoop Piopeities¨ on page 315).
In Apiil 200S, Hauoop won the geneial-puipose teiaLyte soit Lenchmaik (uesciiLeu
in ¨TeiaByte Soit on Apache Hauoop¨ on page 601), anu one ol the optimizations
useu was this one ol keeping the inteimeuiate uata in memoiy on the ieuuce siue.
Tab|c ó-1. Map-sidc tuning propcrtics
Property name Type Default value Description
io.sort.mb int 100 The size, in megabytes, of the
memory buffer to use while sorting
map output.
io.sort.record.percent float 0.05 The proportion of io.sort.mb
reserved for storing record bound-
aries of the map outputs. The re-
maining space is used for the map
output records themselves. This
property was removed in release
0.21.0 as the shuffle code was im-
proved to do a better job of using
all the available memory for map
output and accounting informa-
io.sort.spill.percent float 0.80 The threshold usage proportion for
both the map output memory
buffer and the record boundaries
index to start the process of spilling
to disk.
io.sort.factor int 10 The maximum number of streams
to merge at once when sorting files.
This property is also used in the re-
duce. It’s fairly common to increase
this to 100.
210 | Chapter 6: How MapReduce Works
Property name Type Default value Description
int 3 The minimum number of spill files
needed for the combiner to run (if
a combiner is specified).
boolean false Compress map outputs.
Class name org.apache.hadoop.io.
The compression codec to use for
map outputs.
int 40 The number of worker threads per
tasktracker for serving the map
outputs to reducers. This is a clus-
ter-wide setting and cannot be set
by individual jobs. Not applicable
in MapReduce 2.
Tab|c ó-2. Rcducc-sidc tuning propcrtics
Property name Type Default value Description
int 5 The number of threads used to copy map outputs
to the reducer.
mapred.reduce.copy.backoff int 300 The maximum amount of time, in seconds, to spend
retrieving one map output for a reducer before de-
claring it as failed. The reducer may repeatedly re-
attempt a transfer within this time if it fails (using
exponential backoff).
io.sort.factor int 10 The maximum number of streams to merge at once
when sorting files. This property is also used in the
float 0.70 The proportion of total heap size to be allocated to
the map outputs buffer during the copy phase of the
float 0.66 The threshold usage proportion for the map outputs
buffer (defined by mapred.job.shuf
fle.input.buffer.percent) for starting
the process of merging the outputs and spilling to
mapred.inmem.merge.threshold int 1000 The threshold number of map outputs for starting
the process of merging the outputs and spilling to
disk. A value of 0 or less means there is no threshold,
and the spill behavior is governed solely by
float 0.0 The proportion of total heap size to be used for re-
taining map outputs in memory during the reduce.
For the reduce phase to begin, the size of map out-
puts in memory must be no more than this size. By
Shuffle and Sort | 211
Property name Type Default value Description
default, all map outputs are merged to disk before
the reduce begins, to give the reducers as much
memory as possible. However, if your reducers re-
quire less memory, this value may be increased to
minimize the number of trips to disk.
Task Execution
Ve saw how the MapReuuce system executes tasks in the context ol the oveiall joL at
the Leginning ol the chaptei in ¨Anatomy ol a MapReuuce ]oL Run¨ on page 1S7. In
this section, we`ll look at some moie contiols that MapReuuce useis have ovei task
The Task Execution Environment
Hauoop pioviues inloimation to a map oi ieuuce task aLout the enviionment in which
it is iunning. Foi example, a map task can uiscovei the name ol the lile it is piocessing
(see ¨File inloimation in the mappei¨ on page 239), anu a map oi ieuuce task can linu
out the attempt numLei ol the task. The piopeities in TaLle 6-3 can Le accesseu liom
the joL`s conliguiation, oLtaineu in the olu MapReuuce API Ly pioviuing an imple-
mentation ol the configure() methou loi Mapper oi Reducer, wheie the conliguiation
is passeu in as an aigument. In the new API these piopeities can Le accesseu liom the
context oLject passeu to all methous ol the Mapper oi Reducer.
Tab|c ó-3. Tas| cnvironncnt propcrtics
Property name Type Description Example
mapred.job.id String The job ID. (See “Job,
Task, and Task Attempt
IDs” on page 163 for a
description of the
mapred.tip.id String The task ID. task_200811201130_0004_m_000003
mapred.task.id String The task attempt ID.
(Not the task ID.)
int The index of the task
within the job.
mapred.task.is.map boolean Whether this task is a
map task.
212 | Chapter 6: How MapReduce Works
Streaming environment variables
Hauoop sets joL conliguiation paiameteis as enviionment vaiiaLles loi Stieaming pio-
giams. Howevei, it ieplaces nonalphanumeiic chaiacteis with unueiscoies to make
suie they aie valiu names. The lollowing Python expiession illustiates how you can
ietiieve the value ol the mapred.job.id piopeity liom within a Python Stieaming sciipt:
You can also set enviionment vaiiaLles loi the Stieaming piocesses launcheu Ly Map-
Reuuce Ly supplying the -cmdenv option to the Stieaming launchei piogiam (once loi
each vaiiaLle you wish to set). Foi example, the lollowing sets the MAGIC_PARAMETER
enviionment vaiiaLle:
-cmdenv MAGIC_PARAMETER=abracadabra
Speculative Execution
The MapReuuce mouel is to Lieak joLs into tasks anu iun the tasks in paiallel to make
the oveiall joL execution time smallei than it woulu otheiwise Le il the tasks ian se-
guentially. This makes joL execution time sensitive to slow-iunning tasks, as it takes
only one slow task to make the whole joL take signilicantly longei than it woulu have
uone otheiwise. Vhen a joL consists ol hunuieus oi thousanus ol tasks, the possiLility
ol a lew stiaggling tasks is veiy ieal.
Tasks may Le slow loi vaiious ieasons, incluuing haiuwaie uegiauation oi soltwaie
mis-conliguiation, Lut the causes may Le haiu to uetect since the tasks still complete
successlully, alLeit altei a longei time than expecteu. Hauoop uoesn`t tiy to uiagnose
anu lix slow-iunning tasks; insteau, it tiies to uetect when a task is iunning slowei than
expecteu anu launches anothei, eguivalent, task as a Lackup. This is teimeu spccu|ativc
cxccution ol tasks.
It`s impoitant to unueistanu that speculative execution uoes not woik Ly launching
two uuplicate tasks at aLout the same time so they can iace each othei. This woulu Le
wastelul ol clustei iesouices. Rathei, a speculative task is launcheu only altei all the
tasks loi a joL have Leen launcheu, anu then only loi tasks that have Leen iunning loi
some time (at least a minute) anu have laileu to make as much piogiess, on aveiage, as
the othei tasks liom the joL. Vhen a task completes successlully, any uuplicate tasks
that aie iunning aie killeu since they aie no longei neeueu. So il the oiiginal task com-
pletes Leloie the speculative task, then the speculative task is killeu; on the othei hanu,
il the speculative task linishes liist, then the oiiginal is killeu.
Speculative execution is an optimization, not a leatuie to make joLs iun moie ieliaLly.
Il theie aie Lugs that sometimes cause a task to hang oi slow uown, then ielying on
speculative execution to avoiu these pioLlems is unwise, anu won`t woik ieliaLly, since
the same Lugs aie likely to allect the speculative task. You shoulu lix the Lug so that
the task uoesn`t hang oi slow uown.
Task Execution | 213
Speculative execution is tuineu on Ly uelault. It can Le enaLleu oi uisaLleu inuepenu-
ently loi map tasks anu ieuuce tasks, on a clustei-wiue Lasis, oi on a pei-joL Lasis. The
ielevant piopeities aie shown in TaLle 6-+.
Tab|c ó-1. Spccu|ativc cxccution propcrtics
Property name Type Default value Description
boolean true Whether extra instances of
map tasks may be launched if
a task is making slow pro-
boolean true Whether extra instances of
reduce tasks may be launched
if a task is making slow pro-
Class org.apache.hadoop.mapre
(MapReduce 2 only) The
Speculator class imple-
menting the speculative exe-
cution policy.
Class org.apache.hadoop.mapre
(MapReduce 2 only) An im-
plementation of TaskRun
timeEstimator that pro-
vides estimates for task run-
times, and used by Specula
tor instances.
Vhy woulu you evei want to tuin oll speculative execution? The goal ol speculative
execution is to ieuuce joL execution time, Lut this comes at the cost ol clustei elliciency.
On a Lusy clustei, speculative execution can ieuuce oveiall thioughput, since ieuun-
uant tasks aie Leing executeu in an attempt to Liing uown the execution time loi a
single joL. Foi this ieason, some clustei auministiatois pielei to tuin it oll on the clustei
anu have useis explicitly tuin it on loi inuiviuual joLs. This was especially ielevant loi
oluei veisions ol Hauoop, when speculative execution coulu Le oveily aggiessive in
scheuuling speculative tasks.
Theie is a goou case loi tuining oll speculative execution loi ieuuce tasks, since any
uuplicate ieuuce tasks have to letch the same map outputs as the oiiginal task, anu this
can signilicantly inciease netwoik tiallic on the clustei.
Anothei ieason that speculative execution is tuineu oll is loi tasks that aie not iuem-
potent. Howevei in many cases it is possiLle to wiite tasks to Le iuempotent anu use
an OutputCommitter to piomote the output to its linal location when the task succeeus.
This technigue is explaineu in moie uetail in the next section.
214 | Chapter 6: How MapReduce Works
Output Committers
Hauoop MapReuuce uses a commit piotocol to ensuie that joLs anu tasks eithei suc-
ceeu, oi lail cleanly. The Lehavioi is implementeu Ly the OutputCommitter in use loi the
joL, anu this is set in the olu MapReuuce API Ly calling the setOutputCommitter() on
JobConf, oi Ly setting mapred.output.committer.class in the conliguiation. In the new
MapReuuce API, the OutputCommitter is ueteimineu Ly the OutputFormat, via its getOut
putCommitter() methou. The uelault is FileOutputCommitter, which is appiopiiate loi
lile-Laseu MapReuuce. You can customize an existing OutputCommitter oi even wiite a
new implementation il you neeu to uo special setup oi cleanup loi joLs oi tasks.
The OutputCommitter API is as lollows (in Loth olu anu new MapReuuce APIs):
public abstract class OutputCommitter {
public abstract void setupJob(JobContext jobContext) throws IOException;
public void commitJob(JobContext jobContext) throws IOException { }
public void abortJob(JobContext jobContext, JobStatus.State state)
throws IOException { }
public abstract void setupTask(TaskAttemptContext taskContext)
throws IOException;
public abstract boolean needsTaskCommit(TaskAttemptContext taskContext)
throws IOException;
public abstract void commitTask(TaskAttemptContext taskContext)
throws IOException;
public abstract void abortTask(TaskAttemptContext taskContext)
throws IOException;
The setupJob() methou is calleu Leloie the joL is iun, anu is typically useu to peiloim
initialization. Foi FileOutputCommitter the methou cieates the linal output uiiectoiy,
${mapred.output.dir}, anu a tempoiaiy woiking space loi task output, ${mapred.out
Il the joL succeeus then the commitJob() methou is calleu, which in the uelault lile-
Laseu implementation ueletes the tempoiaiy woiking space, anu cieates a hiuuen
empty maikei lile in the output uiiectoiy calleu _SUCCESS to inuicate to lilesystem
clients that the joL completeu successlully. Il the joL uiu not succeeu, then the abort
Job() is calleu with a state oLject inuicating whethei the joL laileu oi was killeu (Ly a
usei, loi example). In the uelault implementation this will uelete the joL`s tempoiaiy
woiking space.
The opeiations aie similai at the task-level. The setupTask() methou is calleu Leloie
the task is iun, anu the uelault implementation uoesn`t uo anything, since tempoiaiy
uiiectoiies nameu loi task outputs aie cieateu when the task outputs aie wiitten.
The commit phase loi tasks is optional, anu may Le uisaLleu Ly ietuining false liom
needsTaskCommit(). This saves the liamewoik liom having to iun the uistiiLuteu com-
Task Execution | 215
mit piotocol loi the task, anu neithei commitTask() noi abortTask() is calleu. FileOut
putCommitter will skip the commit phase when no output has Leen wiitten Ly a task.
Il a task succeeus then commitTask() is calleu, which in the uelault implementation
moves the tempoiaiay task output uiiectoiy (which has the task attempt ID in its name
to avoiu conllicts Letween task attempts) to the linal output path, ${mapred.out
put.dir}. Otheiwise, the liamewoik calls abortTask(), which ueletes the tempoiaiy
task output uiiectoiy.
The liamewoik ensuies that in the event ol multiple task attempts loi a paiticulai task,
only one will Le committeu, anu the otheis will Le aLoiteu. This suitation may aiise
Lecause the liist attempt laileu loi some ieason÷in which case it woulu Le aLoiteu,
anu a latei, successlul attempt woulu Le committeu. Anothei case is il two task attempts
weie iunning concuiiently as speculative uuplicates, then the one that linisheu liist
woulu Le committeu, anu the othei woulu Le aLoiteu.
Task side-effect files
The usual way ol wiiting output liom map anu ieuuce tasks is Ly using the OutputCol
lector to collect key-value paiis. Some applications neeu moie llexiLility than a single
key-value paii mouel, so these applications wiite output liles uiiectly liom the map oi
ieuuce task to a uistiiLuteu lilesystem, like HDFS. (Theie aie othei ways to piouuce
multiple outputs, too, as uesciiLeu in ¨Multiple Outputs¨ on page 251.)
Caie neeus to Le taken to ensuie that multiple instances ol the same task uon`t tiy to
wiite to the same lile. As we saw in the pievious section, the OutputCommitter piotocol
solves this pioLlem. Il applications wiite siue liles in theii tasks` woiking uiiectoiies,
then the siue liles loi tasks that successlully complete will Le piomoteu to the output
uiiectoiy automatically, while laileu tasks will have theii siue liles ueleteu.
A task may linu its woiking uiiectoiy Ly ietiieving the value ol the mapred.work.out
put.dir piopeity liom its conliguiation lile. Alteinatively, a MapReuuce piogiam using
the ]ava API may call the getWorkOutputPath() static methou on FileOutputFormat to
get the Path oLject iepiesenting the woiking uiiectoiy. The liamewoik cieates the
woiking uiiectoiy Leloie executing the task, so you uon`t neeu to cieate it.
To take a simple example, imagine a piogiam loi conveiting image liles liom one loimat
to anothei. One way to uo this is to have a map-only joL, wheie each map is given a
set ol images to conveit (peihaps using NLineInputFormat; see ¨NLineInputFoi-
mat¨ on page 2+5). Il a map task wiites the conveiteu images into its woiking uiiectoiy,
then they will Le piomoteu to the output uiiectoiy when the task successlully linishes.
Task JVM Reuse
Hauoop iuns tasks in theii own ]ava Viitual Machine to isolate them liom othei iun-
ning tasks. The oveiheau ol staiting a new ]VM loi each task can take aiounu a seconu,
which loi joLs that iun loi a minute oi so is insignilicant. Howevei, joLs that have a
216 | Chapter 6: How MapReduce Works
laige numLei ol veiy shoit-liveu tasks (these aie usually map tasks), oi that have lengthy
initialization, can see peiloimance gains when the ]VM is ieuseu loi suLseguent tasks.
Note that, with task ]VM ieuse enaLleu, tasks aie not iun concuiiently in a single ]VM;
iathei, the ]VM iuns tasks seguentially. Tasktiackeis can, howevei, iun moie than one
task at a time, Lut this is always uone in sepaiate ]VMs. The piopeities loi contiolling
the tasktiackeis` numLei ol map task slots anu ieuuce task slots aie uiscusseu in
¨Memoiy¨ on page 305.
The piopeity loi contiolling task ]VM ieuse is mapred.job.reuse.jvm.num.tasks: it
specilies the maximum numLei ol tasks to iun loi a given joL loi each ]VM launcheu;
the uelault is 1 (see TaLle 6-5). No uistinction is maue Letween map oi ieuuce tasks,
howevei tasks liom uilleient joLs aie always iun in sepaiate ]VMs. The methou set
NumTasksToExecutePerJvm() on JobConf can also Le useu to conliguie this piopeity.
Tab|c ó-5. Tas| j\M Rcusc propcrtics
Property name Type Default value Description
mapred.job.reuse.jvm.num.tasks int 1 The maximum number of tasks to run for a given
job for each JVM on a tasktracker. A value of –1
indicates no limit: the same JVM may be used for
all tasks for a job.
Tasks that aie CPU-Lounu may also Lenelit liom task ]VM ieuse Ly taking auvantage
ol iuntime optimizations applieu Ly the HotSpot ]VM. Altei iunning loi a while, the
HotSpot ]VM Luilus up enough inloimation to uetect peiloimance-ciitical sections in
the coue anu uynamically tianslates the ]ava Lyte coues ol these hot spots into native
machine coue. This woiks well loi long-iunning piocesses, Lut ]VMs that iun loi sec-
onus oi a lew minutes may not gain the lull Lenelit ol HotSpot. In these cases, it is
woith enaLling task ]VM ieuse.
Anothei place wheie a shaieu ]VM is uselul is loi shaiing state Letween the tasks ol a
joL. By stoiing ieleience uata in a static lielu, tasks get iapiu access to the shaieu uata.
Skipping Bad Records
Laige uatasets aie messy. They olten have coiiupt iecoius. They olten have iecoius
that aie in a uilleient loimat. They olten have missing lielus. In an iueal woilu, youi
coue woulu cope giacelully with all ol these conuitions. In piactice, it is olten expeuient
to ignoie the ollenuing iecoius. Depenuing on the analysis Leing peiloimeu, il only a
small peicentage ol iecoius aie allecteu, then skipping them may not signilicantly allect
the iesult. Howevei, il a task tiips up when it encounteis a Lau iecoiu÷Ly thiowing
a iuntime exception÷then the task lails. Failing tasks aie ietiieu (since the lailuie may
Le uue to haiuwaie lailuie oi some othei ieason outsiue the task`s contiol), Lut il a
7. ]VM ieuse is not suppoiteu in MapReuuce 2.
Task Execution | 217
task lails loui times, then the whole joL is maikeu as laileu (see ¨Task Fail-
uie¨ on page 200). Il it is the uata that is causing the task to thiow an exception,
ieiunning the task won`t help, since it will lail in exactly the same way each time.
Il you aie using TextInputFormat (¨TextInputFoimat¨ on page 2++),
then you can set a maximum expecteu line length to saleguaiu against
coiiupteu liles. Coiiuption in a lile can manilest itsell as a veiy long line,
which can cause out ol memoiy eiiois anu then task lailuie. By setting
mapred.linerecordreader.maxlength to a value in Lytes that lits in mem-
oiy (anu is comloitaLly gieatei than the length ol lines in youi input
uata), the iecoiu ieauei will skip the (long) coiiupt lines without the
task lailing.
The Lest way to hanule coiiupt iecoius is in youi mappei oi ieuucei coue. You can
uetect the Lau iecoiu anu ignoie it, oi you can aLoit the joL Ly thiowing an exception.
You can also count the total numLei ol Lau iecoius in the joL using counteis to see
how wiuespieau the pioLlem is.
In iaie cases, though, you can`t hanule the pioLlem Lecause theie is a Lug in a thiiu-
paity liLiaiy that you can`t woik aiounu in youi mappei oi ieuucei. In these cases, you
can use Hauoop`s optional s|ipping nodc loi automatically skipping Lau iecoius.
Vhen skipping moue is enaLleu, tasks iepoit the iecoius Leing piocesseu Lack to the
tasktiackei. Vhen the task lails, the tasktiackei ietiies the task, skipping the iecoius
that causeu the lailuie. Because ol the extia netwoik tiallic anu Lookkeeping to
maintain the laileu iecoiu ianges, skipping moue is tuineu on loi a task only altei it
has laileu twice.
Thus, loi a task consistently lailing on a Lau iecoiu, the tasktiackei iuns the lollowing
task attempts with these outcomes:
1. Task lails.
2. Task lails.
3. Skipping moue is enaLleu. Task lails, Lut laileu iecoiu is stoieu Ly the tasktiackei.
+. Skipping moue is still enaLleu. Task succeeus Ly skipping the Lau iecoiu that laileu
in the pievious attempt.
Skipping moue is oll Ly uelault; you enaLle it inuepenuently loi map anu ieuuce tasks
using the SkipBadRecords class. It`s impoitant to note that skipping moue can uetect
only one Lau iecoiu pei task attempt, so this mechanism is appiopiiate only loi ue-
tecting occasional Lau iecoius (a lew pei task, say). You may neeu to inciease the
maximum numLei ol task attempts (via mapred.map.max.attempts anu
S. Skipping moue is not suppoiteu in the new MapReuuce API. See https://issucs.apachc.org/jira/browsc/
218 | Chapter 6: How MapReduce Works
mapred.reduce.max.attempts) to give skipping moue enough attempts to uetect anu skip
all the Lau iecoius in an input split.
Bau iecoius that have Leen uetecteu Ly Hauoop aie saveu as seguence liles in the joL`s
output uiiectoiy unuei the _|ogs/s|ip suLuiiectoiy. These can Le inspecteu loi uiag-
nostic puiposes altei the joL has completeu (using hadoop fs -text, loi example).
Task Execution | 219
MapReduce Types and Formats
MapReuuce has a simple mouel ol uata piocessing: inputs anu outputs loi the map anu
ieuuce lunctions aie key-value paiis. This chaptei looks at the MapReuuce mouel in
uetail anu, in paiticulai, how uata in vaiious loimats, liom simple text to stiuctuieu
Linaiy oLjects, can Le useu with this mouel.
MapReduce Types
The map anu ieuuce lunctions in Hauoop MapReuuce have the lollowing geneial loim:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
In geneial, the map input key anu value types (K1 anu V1) aie uilleient liom the map
output types (K2 anu V2). Howevei, the ieuuce input must have the same types as the
map output, although the ieuuce output types may Le uilleient again (K3 anu V3). The
]ava API miiiois this geneial loim:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
// ...
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
public class Context extends ReducerContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
// ...
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
Context context) throws IOException, InterruptedException {
// ...
The context oLjects aie useu loi emitting key-value paiis, so they aie paiameteiizeu Ly
the output types, so that the signatuie ol the write() methou is:
public void write(KEYOUT key, VALUEOUT value)
throws IOException, InterruptedException
Since Mapper anu Reducer aie sepaiate classes the type paiameteis have uilleient scopes,
anu the actual type aigument ol KEYIN (say) in the Mapper may Le uilleient to the type
ol the type paiametei ol the same name (KEYIN) in the Reducer. Foi instance, in the
maximum tempaiatuie example liom eailiei chapteis, KEYIN is ieplaceu Ly LongWrita
ble loi the Mapper, anu Ly Text loi the Reducer.
Similaily, even though the map output types anu the ieuuce input types must match,
this is not enloiceu Ly the ]ava compilei.
The type paiameteis aie nameu uilleiently to the aLstiact types (KEYIN veisus K1, anu
so on), Lut the loim is the same.
Il a comLine lunction is useu, then it is the same loim as the ieuuce lunction (anu is
an implementation ol Reducer), except its output types aie the inteimeuiate key anu
value types (K2 anu V2), so they can leeu the ieuuce lunction:
map: (K1, V1) → list(K2, V2)
combine: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
Olten the comLine anu ieuuce lunctions aie the same, in which case, K3 is the same as
K2, anu V3 is the same as V2.
The paitition lunction opeiates on the inteimeuiate key anu value types (K2 anu V2),
anu ietuins the paitition inuex. In piactice, the paitition is ueteimineu solely Ly the
key (the value is ignoieu):
partition: (K2, V2) → integer
Oi in ]ava:
public abstract class Partitioner<KEY, VALUE> {
public abstract int getPartition(KEY key, VALUE value, int numPartitions);
MapReduce signatures in the old API
In the olu API the signatuies aie veiy similai, anu actually name the type paiameteis
K1, V1, anu so on, although the constiaints on the types aie exactly the same in Loth
olu anu new APIs.
public interface Mapper<K1, V1, K2, V2> extends JobConfigurable, Closeable {

222 | Chapter 7: MapReduce Types and Formats
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
throws IOException;
public interface Reducer<K2, V2, K3, V3> extends JobConfigurable, Closeable {

void reduce(K2 key, Iterator<V2> values,
OutputCollector<K3, V3> output, Reporter reporter) throws IOException;
public interface Partitioner<K2, V2> extends JobConfigurable {
int getPartition(K2 key, V2 value, int numPartitions);
So much loi the theoiy, how uoes this help conliguie MapReuuce joLs? TaLle 7-1
summaiizes the conliguiation options loi the new API (anu TaLle 7-2 uoes the same
loi the olu API). It is uiviueu into the piopeities that ueteimine the types anu those that
have to Le compatiLle with the conliguieu types.
Input types aie set Ly the input loimat. So, loi instance, a TextInputFormat geneiates
keys ol type LongWritable anu values ol type Text. The othei types aie set explicitly Ly
calling the methous on the Job (oi JobConf in the olu API). Il not set explicitly, the
inteimeuiate types uelault to the (linal) output types, which uelault to LongWritable
anu Text. So il K2 anu K3 aie the same, you uon`t neeu to call setMapOutputKeyClass(),
since it lalls Lack to the type set Ly calling setOutputKeyClass(). Similaily, il V2 anu
V3 aie the same, you only neeu to use setOutputValueClass().
It may seem stiange that these methous loi setting the inteimeuiate anu linal output
types exist at all. Altei all, why can`t the types Le ueteimineu liom a comLination ol
the mappei anu the ieuucei? The answei is that it`s to uo with a limitation in ]ava
geneiics: type eiasuie means that the type inloimation isn`t always piesent at iuntime,
so Hauoop has to Le given it explicitly. This also means that it`s possiLle to conliguie
a MapReuuce joL with incompatiLle types, Lecause the conliguiation isn`t checkeu at
compile time. The settings that have to Le compatiLle with the MapReuuce types aie
listeu in the lowei pait ol TaLle 7-1. Type conllicts aie uetecteu at iuntime uuiing joL
execution, anu loi this ieason, it is wise to iun a test joL using a small amount ol uata
to llush out anu lix any type incompatiLilities.
MapReduce Types | 223




































224 | Chapter 7: MapReduce Types and Formats





































MapReduce Types | 225
The Default MapReduce Job
Vhat happens when you iun MapReuuce without setting a mappei oi a ieuucei? Let`s
tiy it Ly iunning this minimal MapReuuce piogiam:
public class MinimalMapReduce extends Configured implements Tool {

public int run(String[] args) throws Exception {
if (args.length != 2) {
System.err.printf("Usage: %s [generic options] <input> <output>\n",
return -1;

Job job = new Job(getConf());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduce(), args);
The only conliguiation that we set is an input path anu an output path. Ve iun it ovei
a suLset ol oui weathei uata with the lollowing:
% hadoop MinimalMapReduce "input/ncdc/all/190{1,2}.gz" output
Ve uo get some output: one lile nameu part-r-00000 in the output uiiectoiy. Heie`s
what the liist lew lines look like (tiuncateu to lit the page):
Each line is an integei lolloweu Ly a taL chaiactei, lolloweu Ly the oiiginal weathei
uata iecoiu. Aumitteuly, it`s not a veiy uselul piogiam, Lut unueistanuing how it pio-
uuces its output uoes pioviue some insight into the uelaults that Hauoop uses when
iunning MapReuuce joLs. Example 7-1 shows a piogiam that has exactly the same
ellect as MinimalMapReduce, Lut explicitly sets the joL settings to theii uelaults.
226 | Chapter 7: MapReduce Types and Formats
Exanp|c 7-1. A ninina| MapRcducc drivcr, with thc dcjau|ts cxp|icit|y sct
public class MinimalMapReduceWithDefaults extends Configured implements Tool {

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;






return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
Ve`ve simplilieu the liist lew lines ol the run() methou, Ly extiacting the logic loi
piinting usage anu setting the input anu output paths into a helpei methou. Almost all
MapReuuce uiiveis take these two aiguments (input anu output), so ieuucing
the Loileiplate coue heie is a goou thing. Heie aie the ielevant methous in the
JobBuilder class loi ieleience:
public static Job parseInputAndOutput(Tool tool, Configuration conf,
String[] args) throws IOException {

if (args.length != 2) {
printUsage(tool, "<input> <output>");
return null;
Job job = new Job(conf);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job;
MapReduce Types | 227
public static void printUsage(Tool tool, String extraArgsUsage) {
System.err.printf("Usage: %s [genericOptions] %s\n\n",
tool.getClass().getSimpleName(), extraArgsUsage);
Going Lack to MinimalMapReduceWithDefaults in Example 7-1, although theie aie many
othei uelault joL settings, the ones highlighteu aie those most cential to iunning a joL.
Let`s go thiough them in tuin.
The uelault input loimat is TextInputFormat, which piouuces keys ol type LongWrita
ble (the ollset ol the Leginning ol the line in the lile) anu values ol type Text (the line
ol text). This explains wheie the integeis in the linal output come liom: they aie the
line ollsets.
The uelault mappei is just the Mapper class, which wiites the input key anu value un-
changeu to the output:
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void map(KEYIN key, VALUEIN value,
Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value);
Mapper is a geneiic type, which allows it to woik with any key oi value types. In this
case, the map input anu output key is ol type LongWritable anu the map input anu
output value is ol type Text.
The uelault paititionei is HashPartitioner, which hashes a iecoiu`s key to ueteimine
which paitition the iecoiu Lelongs in. Each paitition is piocesseu Ly a ieuuce task, so
the numLei ol paititions is egual to the numLei ol ieuuce tasks loi the joL:
public class HashPartitioner<K, V> extends Partitioner<K, V> {
public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
The key`s hash coue is tuineu into a nonnegative integei Ly Litwise ANDing it with the
laigest integei value. It is then ieuuceu mouulo the numLei ol paititions to linu the
inuex ol the paitition that the iecoiu Lelongs in.
By uelault, theie is a single ieuucei, anu theieloie a single paitition, so the action ol
the paititionei is iiielevant in this case since eveiything goes into one paitition. How-
evei, it is impoitant to unueistanu the Lehavioi ol HashPartitioner when you have
moie than one ieuuce task. Assuming the key`s hash lunction is a goou one, the iecoius
228 | Chapter 7: MapReduce Types and Formats
will Le evenly allocateu acioss ieuuce tasks, with all iecoius shaiing the same key Leing
piocesseu Ly the same ieuuce task.
You may have noticeu that we uiun`t set the numLei ol map tasks. The ieason loi this
is that the numLei is egual to the numLei ol splits that the input is tuineu into, which
is uiiven Ly size ol the input, anu the lile`s Llock size (il the lile is in HDFS). The options
loi contiolling split size aie uiscusseu in ¨FileInputFoimat input splits¨ on page 236.
Choosing the Number of Reducers
The single ieuucei uelault is something ol a gotcha loi new useis to Hauoop. Almost
all ieal-woilu joLs shoulu set this to a laigei numLei; otheiwise, the joL will Le veiy
slow since all the inteimeuiate uata llows thiough a single ieuuce task. (Note that when
iunning unuei the local joL iunnei, only zeio oi one ieuuceis aie suppoiteu.)
The optimal numLei ol ieuuceis is ielateu to the total numLei ol availaLle ieuucei slots
in youi clustei. The total numLei ol slots is lounu Ly multiplying the numLei ol noues
in the clustei anu the numLei ol slots pei noue (which is ueteimineu Ly the value ol
the mapred.tasktracker.reduce.tasks.maximum piopeity, uesciiLeu in ¨Enviionment
Settings¨ on page 305).
One common setting is to have slightly lewei ieuuceis than total slots, which gives one
wave ol ieuuce tasks (anu toleiates a lew lailuies, without extenuing joL execution
time). Il youi ieuuce tasks aie veiy Lig, then it makes sense to have a laigei numLei ol
ieuuceis (iesulting in two waves, loi example) so that the tasks aie moie line-giaineu,
anu lailuie uoesn`t allect joL execution time signilicantly.
The uelault ieuucei is Reducer, again a geneiic type, which simply wiites all its input
to its output:
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
Context context) throws IOException, InterruptedException {
for (VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
Foi this joL, the output key is LongWritable, anu the output value is Text. In lact, all
the keys loi this MapReuuce piogiam aie LongWritable, anu all the values aie Text,
since these aie the input keys anu values, anu the map anu ieuuce lunctions aie Loth
iuentity lunctions which Ly uelinition pieseive type. Most MapReuuce piogiams,
howevei, uon`t use the same key oi value types thioughout, so you neeu to conliguie
the joL to ueclaie the types you aie using, as uesciiLeu in the pievious section.
Recoius aie soiteu Ly the MapReuuce system Leloie Leing piesenteu to the ieuucei.
In this case, the keys aie soiteu numeiically, which has the ellect ol inteileaving the
lines liom the input liles into one comLineu output lile.
MapReduce Types | 229
The uelault output loimat is TextOutputFormat, which wiites out iecoius, one pei line,
Ly conveiting keys anu values to stiings anu sepaiating them with a taL chaiactei. This
is why the output is taL-sepaiateu: it is a leatuie ol TextOutputFormat.
The default Streaming job
In Stieaming, the uelault joL is similai, Lut not iuentical, to the ]ava eguivalent. The
minimal loim is:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-mapper /bin/cat
Notice that you have to supply a mappei: the uelault iuentity mappei will not woik.
The ieason has to uo with the uelault input loimat, TextInputFormat, which geneiates
LongWritable keys anu Text values. Howevei, Stieaming output keys anu values (in-
cluuing the map keys anu values) aie always Loth ol type Text.
The iuentity mappei
cannot change LongWritable keys to Text keys, so it lails.
Vhen we specily a non-]ava mappei, anu the input loimat is TextInputFormat, Stieam-
ing uoes something special. It uoesn`t pass the key to the mappei piocess, it just passes
the value. (Foi othei input loimats the same ellect can Le achieveu Ly setting
stream.map.input.ignoreKey to true.) This is actually veiy uselul, since the key is just
the line ollset in the lile, anu the value is the line, which is all most applications aie
inteiesteu in. The oveiall ellect ol this joL is to peiloim a soit ol the input.
Vith moie ol the uelaults spelleu out, the commanu looks like this:
% hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-input input/ncdc/sample.txt \
-output output \
-inputformat org.apache.hadoop.mapred.TextInputFormat \
-mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner \
-numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
The mappei anu ieuucei aiguments take a commanu oi a ]ava class. A comLinei may
optionally Le specilieu, using the -combiner aigument.
Keys and values in Streaming
A Stieaming application can contiol the sepaiatoi that is useu when a key-value paii is
tuineu into a seiies ol Lytes anu sent to the map oi ieuuce piocess ovei stanuaiu input.
The uelault is a taL chaiactei, Lut it is uselul to Le aLle to change it in the case that the
keys oi values themselves contain taL chaiacteis.
1. Except when useu in Linaiy moue, liom veision 0.21.0 onwaiu, via the -io rawbytes oi -io typedbytes
options. Text moue (-io text) is the uelault.
230 | Chapter 7: MapReduce Types and Formats
Similaily, when the map oi ieuuce wiites out key-value paiis, they may Le sepaiateu
Ly a conliguiaLle sepaiatoi. Fuitheimoie, the key liom the output can Le composeu
ol moie than the liist lielu: it can Le maue up ol the liist n lielus (uelineu Ly
stream.num.map.output.key.fields oi stream.num.reduce.output.key.fields), with
the value Leing the iemaining lielus. Foi example, il the output liom a Stieaming pio-
cess was a,b,c (anu the sepaiatoi is a comma), anu n is two, then the key woulu Le
paiseu as a,b anu the value as c.
Sepaiatois may Le conliguieu inuepenuently loi maps anu ieuuces. The piopeities aie
listeu in TaLle 7-3 anu shown in a uiagiam ol the uata llow path in Figuie 7-1.
These settings uo not have any Leaiing on the input anu output loimats. Foi example,
il stream.reduce.output.field.separator weie set to Le a colon, say, anu the ieuuce
stieam piocess wiote the line a:b to stanuaiu out, then the Stieaming ieuucei woulu
know to extiact the key as a anu the value as b. Vith the stanuaiu TextOutputFormat,
this iecoiu woulu Le wiitten to the output lile with a taL sepaiating a anu b. You can
change the sepaiatoi that TextOutputFormat uses Ly setting mapred.textoutputfor
A list ol Stieaming conliguiation paiameteis can Le lounu on the Hauoop weLsite at
Tab|c 7-3. Strcaning scparator propcrtics
Property name Type Default value Description
String \t The separator to use when passing the input key and
value strings to the stream map process as a stream of
String \t The separator to use when splitting the output from the
stream map process into key and value strings for the
map output.
int 1 The number of fields separated by
stream.map.output.field.separator to
treat as the map output key.
String \t The separator to use when passing the input key and
value strings to the stream reduce process as a stream of
String \t The separator to use when splitting the output from the
stream reduce process into key and value strings for the
final reduce output.
int 1 The number of fields separated by
stream.reduce.output.field.separator to
treat as the reduce output key.
MapReduce Types | 231
Iigurc 7-1. Whcrc scparators arc uscd in a Strcaning MapRcducc job
Input Formats
Hauoop can piocess many uilleient types ol uata loimats, liom llat text liles to uata-
Lases. In this section, we exploie the uilleient loimats availaLle.
Input Splits and Records
As we saw in Chaptei 2, an input split is a chunk ol the input that is piocesseu Ly a
single map. Each map piocesses a single split. Each split is uiviueu into iecoius, anu
the map piocesses each iecoiu÷a key-value paii÷in tuin. Splits anu iecoius aie log-
ical: theie is nothing that ieguiies them to Le tieu to liles, loi example, although in theii
most common incainations, they aie. In a uataLase context, a split might coiiesponu
to a iange ol iows liom a taLle anu a iecoiu to a iow in that iange (this is piecisely what
DBInputFormat uoes, an input loimat loi ieauing uata liom a ielational uataLase).
Input splits aie iepiesenteu Ly the ]ava class, InputSplit (which, like all ol the classes
mentioneu in this section, is in the org.apache.hadoop.mapreduce package
public abstract class InputSplit {
public abstract long getLength() throws IOException, InterruptedException;
public abstract String[] getLocations() throws IOException,
An InputSplit has a length in Lytes anu a set ol stoiage locations, which aie just host-
name stiings. Notice that a split uoesn`t contain the input uata; it is just a ieleience to
the uata. The stoiage locations aie useu Ly the MapReuuce system to place map tasks
as close to the split`s uata as possiLle, anu the size is useu to oiuei the splits so that the
laigest get piocesseu liist, in an attempt to minimize the joL iuntime (this is an instance
ol a gieeuy appioximation algoiithm).
2. But see the classes in org.apache.hadoop.mapred loi the olu MapReuuce API counteipaits.
232 | Chapter 7: MapReduce Types and Formats
As a MapReuuce application wiitei, you uon`t neeu to ueal with InputSplits uiiectly,
as they aie cieateu Ly an InputFormat. An InputFormat is iesponsiLle loi cieating the
input splits anu uiviuing them into iecoius. Beloie we see some conciete examples ol
InputFormat, let`s Liielly examine how it is useu in MapReuuce. Heie`s the inteilace:
public abstract class InputFormat<K, V> {
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;

public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
The client iunning the joL calculates the splits loi the joL Ly calling getSplits(), then
senus them to the joLtiackei, which uses theii stoiage locations to scheuule map tasks
to piocess them on the tasktiackeis. On a tasktiackei, the map task passes the split to
the createRecordReader() methou on InputFormat to oLtain a RecordReader loi that
split. A RecordReader is little moie than an iteiatoi ovei iecoius, anu the map task uses
one to geneiate iecoiu key-value paiis, which it passes to the map lunction. Ve can
see this Ly looking at the Mapper`s run() methou:
public void run(Context context) throws IOException, InterruptedException {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
Altei iunning setup(), the nextKeyValue() is calleu iepeateuly on the Context, (which
uelegates to the iuentically-nameu methou on the the RecordReader) to populate the
key anu value oLjects loi the mappei. The key anu value aie ietiieveu liom the Record
Reader Ly way ol the Context, anu passeu to the map() methou loi it to uo its woik.
Vhen the ieauei gets to the enu ol the stieam, the nextKeyValue() methou ietuins
false, anu the map task iuns its cleanup() methou, then completes.
Input Formats | 233
It`s not shown in the coue snippet, Lut loi ieasons ol elliciency Record
Reader implementations will ietuin the sanc key anu value oLjects on
each call to getCurrentKey() anu getCurrentValue(). Only the contents
ol these oLjects aie changeu Ly the ieauei`s nextKeyValue() methou.
This can Le a suipiise to useis, who might expect keys anu values to Le
immutaLle, anu not to Le ieuseu. This causes pioLlems when a ieleience
to a key oi value oLject is ietaineu outsiue the map() methou, as its value
can change without waining. Il you neeu to uo this, make a copy ol the
oLject you want to holu on to. Foi example, loi a Text oLject, you can
use its copy constiuctoi: new Text(value).
The situation is similai with ieuuceis. In this case, the value oLjects in
the ieuucei`s iteiatoi aie ieuseu, so you neeu to copy any that you neeu
to ietain Letween calls to the iteiatoi (see Example S-1+).
Finally, note that the Mapper`s run() methou is puLlic, anu may Le customizeu Ly useis.
mappeis. MultithreadedMapper is an implementation that iuns mappeis concuiiently
in a conliguiaLle numLei ol thieaus (set Ly mapreduce.mapper.multithreadedmap
per.threads). Foi most uata piocessing tasks, it conleis no auvantage ovei the uelault
implementation. Howevei, loi mappeis that spenu a long time piocessing each iecoiu,
Lecause they contact exteinal seiveis, loi example, it allows multiple mappeis to iun
in one ]VM with little contention. See ¨Fetchei: A multithieaueu MapRunnei in ac-
tion¨ on page 575 loi an example ol an application that uses the multi-thieaueu veision
(using the olu API).
FileInputFormat is the Lase class loi all implementations ol InputFormat that use liles
as theii uata souice (see Figuie 7-2). It pioviues two things: a place to ueline which liles
aie incluueu as the input to a joL, anu an implementation loi geneiating splits loi the
input liles. The joL ol uiviuing splits into iecoius is peiloimeu Ly suLclasses.
FileInputFormat input paths
The input to a joL is specilieu as a collection ol paths, which olleis gieat llexiLility in
constiaining the input to a joL. FileInputFormat olleis loui static convenience methous
loi setting a Job`s input paths:
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String commaSeparatedPaths)
public static void setInputPaths(Job job, Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths)

The addInputPath() anu addInputPaths() methous auu a path oi paths to the list ol
inputs. You can call these methous iepeateuly to Luilu the list ol paths. The setInput
Paths() methous set the entiie list ol paths in one go (ieplacing any paths set on the
Job in pievious calls).
234 | Chapter 7: MapReduce Types and Formats
A path may iepiesent a lile, a uiiectoiy, oi, Ly using a gloL, a collection ol liles anu
uiiectoiies. A path iepiesenting a uiiectoiy incluues all the liles in the uiiectoiy as input
to the joL. See ¨File patteins¨ on page 67 loi moie on using gloLs.
The contents ol a uiiectoiy specilieu as an input path aie not piocesseu
iecuisively. In lact, the uiiectoiy shoulu only contain liles: il the uiiec-
toiy contains a suLuiiectoiy, it will Le inteipieteu as a lile, which will
cause an eiioi. The way to hanule this case is to use a lile gloL oi a liltei
to select only the liles in the uiiectoiy Laseu on a name pattein. Altei-
natively, set mapred.input.dir.recursive to true to loice the input ui-
iectoiy to Le ieau iecuisively.
The auu anu set methous allow liles to Le specilieu Ly inclusion only. To excluue ceitain
liles liom the input, you can set a liltei using the setInputPathFilter() methou on
public static void setInputPathFilter(Job job, Class<? extends PathFilter> filter)
Filteis aie uiscusseu in moie uetail in ¨PathFiltei¨ on page 6S.
Iigurc 7-2. |nputIornat c|ass hicrarchy
Input Formats | 235
Even il you uon`t set a liltei, FileInputFormat uses a uelault liltei that excluues hiuuen
liles (those whose names Legin with a uot oi an unueiscoie). Il you set a liltei Ly calling
setInputPathFilter(), it acts in auuition to the uelault liltei. In othei woius, only non-
hiuuen liles that aie accepteu Ly youi liltei get thiough.
Paths anu lilteis can Le set thiough conliguiation piopeities, too (TaLle 7-+), which
can Le hanuy loi Stieaming anu Pipes. Setting paths is uone with the -input option loi
Loth Stieaming anu Pipes inteilaces, so setting paths uiiectly is not usually neeueu.
Tab|c 7-1. |nput path and ji|tcr propcrtics
Property name Type Default value Description
mapred.input.dir comma-separated paths none The input files for a job. Paths that contain commas
should have those commas escaped by a backslash
character. For example, the glob {a,b} would be
escaped as {a\,b}.
none The filter to apply to the input files for a job.
FileInputFormat input splits
Given a set ol liles, how uoes FileInputFormat tuin them into splits? FileInputFormat
splits only laige liles. Heie ¨laige¨ means laigei than an HDFS Llock. The split size is
noimally the size ol an HDFS Llock, which is appiopiiate loi most applications; how-
evei, it is possiLle to contiol this value Ly setting vaiious Hauoop piopeities, as shown
in TaLle 7-5.
Tab|c 7-5. Propcrtics jor contro||ing sp|it sizc
Property name Type Default value Description
mapred.min.split.size int 1 The smallest valid size in
bytes for a file split.
long Long.MAX_VALUE, that is
The largest valid size in
bytes for a file split.
dfs.block.size long 64 MB, that is 67108864 The size of a block in HDFS
in bytes.
This property is not present in the old MapReduce API (with the exception of CombineFileInputFormat). Instead, it is calculated
indirectly as the size of the total input for the job, divided by the guide number of map tasks specified by mapred.map.tasks (or the
setNumMapTasks() method on JobConf). Because mapred.map.tasks defaults to 1, this makes the maximum split size the size
of the input.
The minimum split size is usually 1 Lyte, although some loimats have a lowei Lounu
on the split size. (Foi example, seguence liles inseit sync entiies eveiy so olten in the
stieam, so the minimum split size has to Le laige enough to ensuie that eveiy split has
a sync point to allow the ieauei to iesynchionize with a iecoiu Lounuaiy.)
236 | Chapter 7: MapReduce Types and Formats
Applications may impose a minimum split size: Ly setting this to a value laigei than
the Llock size, they can loice splits to Le laigei than a Llock. Theie is no goou ieason
loi uoing this when using HDFS, since uoing so will inciease the numLei ol Llocks that
aie not local to a map task.
The maximum split size uelaults to the maximum value that can Le iepiesenteu Ly a
]ava long type. It has an ellect only when it is less than the Llock size, loicing splits to
Le smallei than a Llock.
The split size is calculateu Ly the loimula (see the computeSplitSize() methou in
max(minimumSize, min(maximumSize, blockSize))
Ly uelault:
minimumSize < blockSize < maximumSize
so the split size is blockSize. Vaiious settings loi these paiameteis anu how they allect
the linal split size aie illustiateu in TaLle 7-6.
Tab|c 7-ó. Exanp|cs oj how to contro| thc sp|it sizc
Minimum split size Maximum split size Block size Split size Comment
1 (default) Long.MAX_VALUE
64 MB (default) 64 MB By default, split size is the same as the
default block size.
1 (default) Long.MAX_VALUE
128 MB 128 MB The most natural way to increase the
split size is to have larger blocks in
HDFS, by setting dfs.block
size, or on a per-file basis at file
construction time.
64 MB (default) 128 MB Making the minimum split size
greater than the block size increases
the split size, but at the cost of locality.
1 (default) 32 MB 64 MB (default) 32 MB Making the maximum split size less
than the block size decreases the split
Small files and CombineFileInputFormat
Hauoop woiks Lettei with a small numLei ol laige liles than a laige numLei ol small
liles. One ieason loi this is that FileInputFormat geneiates splits in such a way that each
split is all oi pait ol a single lile. Il the lile is veiy small (¨small¨ means signilicantly
smallei than an HDFS Llock) anu theie aie a lot ol them, then each map task will piocess
veiy little input, anu theie will Le a lot ol them (one pei lile), each ol which imposes
extia Lookkeeping oveiheau. Compaie a 1 GB lile Lioken into sixteen 6+ MB Llocks,
anu 10,000 oi so 100 KB liles. The 10,000 liles use one map each, anu the joL time can
Le tens oi hunuieus ol times slowei than the eguivalent one with a single input lile anu
16 map tasks.
Input Formats | 237
The situation is alleviateu somewhat Ly CombineFileInputFormat, which was uesigneu
to woik well with small liles. Vheie FileInputFormat cieates a split pei lile,
CombineFileInputFormat packs many liles into each split so that each mappei has moie
to piocess. Ciucially, CombineFileInputFormat takes noue anu iack locality into account
when ueciuing which Llocks to place in the same split, so it uoes not compiomise the
speeu at which it can piocess the input in a typical MapReuuce joL.
Ol couise, il possiLle, it is still a goou iuea to avoiu the many small liles case since
MapReuuce woiks Lest when it can opeiate at the tianslei iate ol the uisks in the clustei,
anu piocessing many small liles incieases the numLei ol seeks that aie neeueu to iun
a joL. Also, stoiing laige numLeis ol small liles in HDFS is wastelul ol the namenoue`s
memoiy. One technigue loi avoiuing the many small liles case is to meige small liles
into laigei liles Ly using a SeguenceFile: the keys can act as lilenames (oi a constant
such as NullWritable, il not neeueu) anu the values as lile contents. See Example 7-+.
But il you alieauy have a laige numLei ol small liles in HDFS, then CombineFileInput
Format is woith tiying.
CombineFileInputFormat isn`t just goou loi small liles÷it can Liing Len-
elits when piocessing laige liles, too. Essentially, CombineFileInputFor
mat uecouples the amount ol uata that a mappei consumes liom the
Llock size ol the liles in HDFS.
Il youi mappeis can piocess each Llock in a mattei ol seconus, then
you coulu use CombineFileInputFormat with the maximum split size set
to a small multiple ol the numLei ol Llocks (Ly setting the
mapred.max.split.size piopeity in Lytes) so that each mappei piocesses
moie than one Llock. In ietuin, the oveiall piocessing time lalls, since
piopoitionally lewei mappeis iun, which ieuuces the oveiheau in task
Lookkeeping anu staitup time associateu with a laige numLei ol shoit-
liveu mappeis.
Since CombineFileInputFormat is an aLstiact class without any conciete classes (unlike
FileInputFormat), you neeu to uo a Lit moie woik to use it. (Hopelully, common im-
plementations will Le auueu to the liLiaiy ovei time.) Foi example, to have the
CombineFileInputFormat eguivalent ol TextInputFormat, you woulu cieate a conciete
suLclass ol CombineFileInputFormat anu implement the getRecordReader() methou.
238 | Chapter 7: MapReduce Types and Formats
Preventing splitting
Some applications uon`t want liles to Le split, so that a single mappei can piocess each
input lile in its entiiety. Foi example, a simple way to check il all the iecoius in a lile
aie soiteu is to go thiough the iecoius in oiuei, checking whethei each iecoiu is not
less than the pieceuing one. Implementeu as a map task, this algoiithm will woik only
il one map piocesses the whole lile.
Theie aie a couple ol ways to ensuie that an existing lile is not split. The liist (guick
anu uiity) way is to inciease the minimum split size to Le laigei than the laigest lile in
youi system. Setting it to its maximum value, Long.MAX_VALUE, has this ellect. The sec-
onu is to suLclass the conciete suLclass ol FileInputFormat that you want to use, to
oveiiiue the isSplitable() methou
to ietuin false. Foi example, heie`s a nonsplittaLle
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
public class NonSplittableTextInputFormat extends TextInputFormat {
protected boolean isSplitable(JobContext context, Path file) {
return false;
File information in the mapper
A mappei piocessing a lile input split can linu inloimation aLout the split Ly calling
the getInputSplit() methou on the Mapper`s Context oLject. Vhen the input loimat
ueiives liom FileInputFormat, the InputSplit ietuineu Ly this methou can Le cast to a
FileSplit to access the lile inloimation listeu in TaLle 7-7.
In the olu MapReuuce API, Stieaming, anu Pipes, the same lile split inloimation is
maue availaLle thiough piopeities which can Le ieau liom the mappei`s conliguiation.
(In the olu MapReuuce API this is achieveu Ly implementing configure() in youi
Mapper implementation to get access to the JobConf oLject.)
In auuition to the piopeities in TaLle 7-7 all mappeis anu ieuuceis have access to the
piopeities listeu in ¨The Task Execution Enviionment¨ on page 212.
Tab|c 7-7. Ii|c sp|it propcrtics
FileSplit method Property name Type Description
getPath() map.input.file Path/String The path of the input file being processed
3. This is how the mappei in SortValidator.RecordStatsChecker is implementeu.
+. In the methou name isSplitable(), ¨splitaLle¨ has a single ¨t.¨ It is usually spelleu ¨splittaLle,¨ which
is the spelling I have useu in this Look.
Input Formats | 239
FileSplit method Property name Type Description
getStart() map.input.start long The byte offset of the start of the split from the beginning
of the file
getLength() map.input.length long The length of the split in bytes
In the next section, you shall see how to use a FileSplit when we neeu to access the
split`s lilename.
Processing a whole file as a record
A ielateu ieguiiement that sometimes ciops up is loi mappeis to have access to the lull
contents ol a lile. Not splitting the lile gets you pait ol the way theie, Lut you also neeu
to have a RecordReader that ueliveis the lile contents as the value ol the iecoiu. The
listing loi WholeFileInputFormat in Example 7-2 shows a way ol uoing this.
Exanp|c 7-2. An |nputIornat jor rcading a who|c ji|c as a rccord
public class WholeFileInputFormat
extends FileInputFormat<NullWritable, BytesWritable> {

protected boolean isSplitable(JobContext context, Path file) {
return false;
public RecordReader<NullWritable, BytesWritable> createRecordReader(
InputSplit split, TaskAttemptContext context) throws IOException,
InterruptedException {
WholeFileRecordReader reader = new WholeFileRecordReader();
reader.initialize(split, context);
return reader;
WholeFileInputFormat uelines a loimat wheie the keys aie not useu, iepiesenteu Ly
NullWritable, anu the values aie the lile contents, iepiesenteu Ly BytesWritable in-
stances. It uelines two methous. Fiist, the loimat is caielul to specily that input liles
shoulu nevei Le split, Ly oveiiiuing isSplitable() to ietuin false. Seconu, we
implement createRecordReader() to ietuin a custom implementation ol Record
Reader, which appeais in Example 7-3.
Exanp|c 7-3. Thc RccordRcadcr uscd by Who|cIi|c|nputIornat jor rcading a who|c ji|c as a rccord
class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {

private FileSplit fileSplit;
private Configuration conf;
private BytesWritable value = new BytesWritable();
private boolean processed = false;
240 | Chapter 7: MapReduce Types and Formats
public void initialize(InputSplit split, TaskAttemptContext context)
throws IOException, InterruptedException {
this.fileSplit = (FileSplit) split;
this.conf = context.getConfiguration();

public boolean nextKeyValue() throws IOException, InterruptedException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
processed = true;
return true;
return false;

public NullWritable getCurrentKey() throws IOException, InterruptedException {
return NullWritable.get();
public BytesWritable getCurrentValue() throws IOException,
InterruptedException {
return value;
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
public void close() throws IOException {
// do nothing
WholeFileRecordReader is iesponsiLle loi taking a FileSplit anu conveiting it into a
single iecoiu, with a null key anu a value containing the Lytes ol the lile. Because theie
is only a single iecoiu, WholeFileRecordReader has eithei piocesseu it oi not, so it main-
tains a Loolean calleu processed. Il, when the nextKeyValue() methou is calleu, the lile
Input Formats | 241
has not Leen piocesseu, then we open the lile, cieate a Lyte aiiay whose length is the
length ol the lile, anu use the Hauoop IOUtils class to sluip the lile into the Lyte aiiay.
Then we set the aiiay on the BytesWritable instance that was passeu into the next()
methou, anu ietuin true to signal that a iecoiu has Leen ieau.
The othei methous aie stiaightloiwaiu Lookkeeping methous loi accessing the cuiient
key anu value types, getting the piogiess ol the ieauei, anu a close() methou, which
is invokeu Ly the MapReuuce liamewoik when the ieauei is uone with.
To uemonstiate how WholeFileInputFormat can Le useu, consiuei a MapReuuce joL loi
packaging small liles into seguence liles, wheie the key is the oiiginal lilename, anu the
value is the content ol the lile. The listing is in Example 7-+.
Exanp|c 7-1. A MapRcducc progran jor pac|aging a co||cction oj sna|| ji|cs as a sing|c ScqucnccIi|c
public class SmallFilesToSequenceFileConverter extends Configured
implements Tool {

static class SequenceFileMapper
extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {

private Text filenameKey;

protected void setup(Context context) throws IOException,
InterruptedException {
InputSplit split = context.getInputSplit();
Path path = ((FileSplit) split).getPath();
filenameKey = new Text(path.toString());

protected void map(NullWritable key, BytesWritable value, Context context)
throws IOException, InterruptedException {
context.write(filenameKey, value);

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;


242 | Chapter 7: MapReduce Types and Formats
return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(), args);
Since the input loimat is a WholeFileInputFormat, the mappei has to only linu the
lilename loi the input lile split. It uoes this Ly casting the InputSplit liom the context
to a FileSplit, which has a methou to ietiieve the lile path. The path is stoieu in a
Text oLject loi the key. The ieuucei is the iuentity (not explicitly set), anu the output
loimat is a SequenceFileOutputFormat.
Heie`s a iun on a lew small liles. Ve`ve chosen to use two ieuuceis, so we get two
output seguence liles:
% hadoop jar hadoop-examples.jar SmallFilesToSequenceFileConverter \
-conf conf/hadoop-localhost.xml -D mapred.reduce.tasks=2 input/smallfiles output
Two pait liles aie cieateu, each ol which is a seguence lile, which we can inspect with
the -text option to the lilesystem shell:
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00000
hdfs://localhost/user/tom/input/smallfiles/a 61 61 61 61 61 61 61 61 61 61
hdfs://localhost/user/tom/input/smallfiles/c 63 63 63 63 63 63 63 63 63 63
% hadoop fs -conf conf/hadoop-localhost.xml -text output/part-r-00001
hdfs://localhost/user/tom/input/smallfiles/b 62 62 62 62 62 62 62 62 62 62
hdfs://localhost/user/tom/input/smallfiles/d 64 64 64 64 64 64 64 64 64 64
hdfs://localhost/user/tom/input/smallfiles/f 66 66 66 66 66 66 66 66 66 66
The input liles weie nameu a, b, c, d, c, anu j, anu each containeu 10 chaiacteis ol the
coiiesponuing lettei (so, loi example, a containeu 10 ¨a¨ chaiacteis), except c, which
was empty. Ve can see this in the textual ienueiing ol the seguence liles, which piints
the lilename lolloweu Ly the hex iepiesentation ol the lile.
Theie`s at least one way we coulu impiove this piogiam. As mentioneu eailiei, having
one mappei pei lile is inellicient, so suLclassing CombineFileInputFormat insteau ol
FileInputFormat woulu Le a Lettei appioach. Also, loi a ielateu technigue ol packing
liles into a Hauoop Aichive, iathei than a seguence lile, see the section ¨Hauoop Ai-
chives¨ on page 7S.
Text Input
Hauoop excels at piocessing unstiuctuieu text. In this section, we uiscuss the uilleient
InputFormats that Hauoop pioviues to piocess text.
Input Formats | 243
TextInputFormat is the uelault InputFormat. Each iecoiu is a line ol input. The key, a
LongWritable, is the Lyte ollset within the lile ol the Leginning ol the line. The value is
the contents ol the line, excluuing any line teiminatois (newline, caiiiage ietuin), anu
is packageu as a Text oLject. So a lile containing the lollowing text:
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
is uiviueu into one split ol loui iecoius. The iecoius aie inteipieteu as the lollowing
key-value paiis:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
Cleaily, the keys aie not line numLeis. This woulu Le impossiLle to implement in gen-
eial, in that a lile is Lioken into splits, at Lyte, not line, Lounuaiies. Splits aie piocesseu
inuepenuently. Line numLeis aie ieally a seguential notion: you have to keep a count
ol lines as you consume them, so knowing the line numLei within a split woulu Le
possiLle, Lut not within the lile.
Howevei, the ollset within the lile ol each line is known Ly each split inuepenuently ol
the othei splits, since each split knows the size ol the pieceuing splits anu just auus this
on to the ollsets within the split to piouuce a gloLal lile ollset. The ollset is usually
sullicient loi applications that neeu a unigue iuentiliei loi each line. ComLineu with
the lile`s name, it is unigue within the lilesystem. Ol couise, il all the lines aie a lixeu
wiuth, then calculating the line numLei is simply a mattei ol uiviuing the ollset Ly the
The Relationship Between Input Splits and HDFS Blocks
The logical iecoius that FileInputFormats ueline uo not usually lit neatly into HDFS
Llocks. Foi example, a TextInputFormat`s logical iecoius aie lines, which will cioss
HDFS Lounuaiies moie olten than not. This has no Leaiing on the lunctioning ol youi
piogiam÷lines aie not misseu oi Lioken, loi example÷Lut it`s woith knowing aLout,
as it uoes mean that uata-local maps (that is, maps that aie iunning on the same host
as theii input uata) will peiloim some iemote ieaus. The slight oveiheau this causes is
not noimally signilicant.
Figuie 7-3 shows an example. A single lile is Lioken into lines, anu the line Lounuaiies
uo not coiiesponu with the HDFS Llock Lounuaiies. Splits honoi logical iecoiu Lounu-
aiies, in this case lines, so we see that the liist split contains line 5, even though it spans
the liist anu seconu Llock. The seconu split staits at line 6.
244 | Chapter 7: MapReduce Types and Formats
Iigurc 7-3. Logica| rccords and HDIS b|oc|s jor Tcxt|nputIornat
TextInputFormat`s keys, Leing simply the ollset within the lile, aie not noimally veiy
uselul. It is common loi each line in a lile to Le a key-value paii, sepaiateu Ly a uelimitei
such as a taL chaiactei. Foi example, this is the output piouuceu Ly TextOutputFor
mat, Hauoop`s uelault OutputFormat. To inteipiet such liles coiiectly, KeyValueTextIn
putFormat is appiopiiate.
You can specily the sepaiatoi via the mapreduce.input.keyvaluelinerecor
dreader.key.value.separator piopeity (oi key.value.separator.in.input.line in the
olu API). It is a taL chaiactei Ly uelault. Consiuei the lollowing input lile, wheie
iepiesents a (hoiizontal) taL chaiactei:
line1→On the top of the Crumpetty Tree
line2→The Quangle Wangle sat,
line3→But his face you could not see,
line4→On account of his Beaver Hat.
Like in the TextInputFormat case, the input is in a single split compiising loui iecoius,
although this time the keys aie the Text seguences Leloie the taL in each line:
(line1, On the top of the Crumpetty Tree)
(line2, The Quangle Wangle sat,)
(line3, But his face you could not see,)
(line4, On account of his Beaver Hat.)
Vith TextInputFormat anu KeyValueTextInputFormat, each mappei ieceives a vaiiaLle
numLei ol lines ol input. The numLei uepenus on the size ol the split anu the length
ol the lines. Il you want youi mappeis to ieceive a lixeu numLei ol lines ol input, then
NLineInputFormat is the InputFormat to use. Like TextInputFormat, the keys aie the Lyte
ollsets within the lile anu the values aie the lines themselves.
N ieleis to the numLei ol lines ol input that each mappei ieceives. Vith N set to
one (the uelault), each mappei ieceives exactly one line ol input. The mapre
duce.input.lineinputformat.linespermap piopeity (mapred.line.input.format.line
spermap in the olu API) contiols the value ol N. By way ol example, consiuei these loui
lines again:
Input Formats | 245
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Il, loi example, N is two, then each split contains two lines. One mappei will ieceive
the liist two key-value paiis:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
Anu anothei mappei will ieceive the seconu two key-value paiis:
(57, But his face you could not see,)
(89, On account of his Beaver Hat.)
The keys anu values aie the same as TextInputFormat piouuces. Vhat is uilleient is the
way the splits aie constiucteu.
Usually, having a map task loi a small numLei ol lines ol input is inellicient (uue to the
oveiheau in task setup), Lut theie aie applications that take a small amount ol input
uata anu iun an extensive (that is, CPU-intensive) computation loi it, then emit theii
output. Simulations aie a goou example. By cieating an input lile that specilies input
paiameteis, one pei line, you can peiloim a paranctcr swccp: iun a set ol simulations
in paiallel to linu how a mouel vaiies as the paiametei changes.
Il you have long-iunning simulations, you may lall aloul ol task time-
outs. Vhen a task uoesn`t iepoit piogiess loi moie than 10 minutes,
then the tasktiackei assumes it has laileu anu aLoits the piocess (see
¨Task Failuie¨ on page 200).
The Lest way to guaiu against this is to iepoit piogiess peiiouically, Ly
wiiting a status message, oi inciementing a countei, loi example. See
¨Vhat Constitutes Piogiess in MapReuuce?¨ on page 193.
Anothei example is using Hauoop to Lootstiap uata loauing liom multiple uata
souices, such as uataLases. You cieate a ¨seeu¨ input lile that lists the uata souices,
one pei line. Then each mappei is allocateu a single uata souice, anu it loaus the uata
liom that souice into HDFS. The joL uoesn`t neeu the ieuuce phase, so the numLei ol
ieuuceis shoulu Le set to zeio (Ly calling setNumReduceTasks() on Job). Fuitheimoie,
MapReuuce joLs can Le iun to piocess the uata loaueu into HDFS. See Appenuix C loi
an example.
Most XML paiseis opeiate on whole XML uocuments, so il a laige XML uocument is
maue up ol multiple input splits, then it is a challenge to paise these inuiviuually. Ol
couise, you can piocess the entiie XML uocument in one mappei (il it is not too laige)
using the technigue in ¨Piocessing a whole lile as a iecoiu¨ on page 2+0.
246 | Chapter 7: MapReduce Types and Formats
Laige XML uocuments that aie composeu ol a seiies ol ¨iecoius¨ (XML uocument
liagments) can Le Lioken into these iecoius using simple stiing oi iegulai-expiession
matching to linu stait anu enu tags ol iecoius. This alleviates the pioLlem when the
uocument is split Ly the liamewoik, since the next stait tag ol a iecoiu is easy to linu
Ly simply scanning liom the stait ol the split, just like TextInputFormat linus newline
Hauoop comes with a class loi this puipose calleu StreamXmlRecordReader (which is in
the org.apache.hadoop.streaming package, although it can Le useu outsiue ol Stieam-
ing). You can use it Ly setting youi input loimat to StreamInputFormat anu setting the
stream.recordreader.class piopeity to org.apache.hadoop.streaming.StreamXmlRecor
dReader. The ieauei is conliguieu Ly setting joL conliguiation piopeities to tell it the
patteins loi the stait anu enu tags (see the class uocumentation loi uetails).
To take an example, Vikipeuia pioviues uumps ol its content in XML loim, which aie
appiopiiate loi piocessing in paiallel using MapReuuce using this appioach. The uata
is containeu in one laige XML wiappei uocument, which contains a seiies ol elements,
such as page elements that contain a page`s content anu associateu metauata. Using
StreamXmlRecordReader, the page elements can Le inteipieteu as iecoius loi piocessing
Ly a mappei.
Binary Input
Hauoop MapReuuce is not just iestiicteu to piocessing textual uata÷it has suppoit
loi Linaiy loimats, too.
Hauoop`s seguence lile loimat stoies seguences ol Linaiy key-value paiis. Seguence
liles aie well suiteu as a loimat loi MapReuuce uata since they aie splittaLle (they have
sync points so that ieaueis can synchionize with iecoiu Lounuaiies liom an aiLitiaiy
point in the lile, such as the stait ol a split), they suppoit compiession as a pait ol the
loimat, anu they can stoie aiLitiaiy types using a vaiiety ol seiialization liamewoiks.
(These topics aie coveieu in ¨SeguenceFile¨ on page 132.)
To use uata liom seguence liles as the input to MapReuuce, you use SequenceFileIn
putFormat. The keys anu values aie ueteimineu Ly the seguence lile, anu you neeu to
make suie that youi map input types coiiesponu. Foi example, il youi seguence lile
has IntWritable keys anu Text values, like the one cieateu in Chaptei +, then the map
signatuie woulu Le Mapper<IntWritable, Text, K, V>, wheie K anu V aie the types ol
the map`s output keys anu values.
5. See Mahout`s XmlInputFormat (availaLle liom http://nahout.apachc.org/) loi an impioveu XML input
Input Formats | 247
Although its name uoesn`t give it away, SequenceFileInputFormat can
ieau MapFiles as well as seguence liles. Il it linus a uiiectoiy wheie it
was expecting a seguence lile, SequenceFileInputFormat assumes that it
is ieauing a MapFile anu uses its uata lile. This is why theie is no
MapFileInputFormat class.
SequenceFileAsTextInputFormat is a vaiiant ol SequenceFileInputFormat that conveits
the seguence lile`s keys anu values to Text oLjects. The conveision is peiloimeu Ly
calling toString() on the keys anu values. This loimat makes seguence liles suitaLle
input loi Stieaming.
SequenceFileAsBinaryInputFormat is a vaiiant ol SequenceFileInputFormat that ietiieves
the seguence lile`s keys anu values as opague Linaiy oLjects. They aie encapsulateu as
BytesWritable oLjects, anu the application is liee to inteipiet the unueilying Lyte aiiay
as it pleases. ComLineu with a piocess that cieates seguence liles with Sequence
File.Writer`s appendRaw() methou, this pioviues a way to use any Linaiy uata types
with MapReuuce (packageu as a seguence lile), although plugging into Hauoop`s se-
iialization mechanism is noimally a cleanei alteinative (see ¨Seiialization Fiame-
woiks¨ on page 110).
Multiple Inputs
Although the input to a MapReuuce joL may consist ol multiple input liles (constiucteu
Ly a comLination ol lile gloLs, lilteis, anu plain paths), all ol the input is inteipieteu
Ly a single InputFormat anu a single Mapper. Vhat olten happens, howevei, is that ovei
time, the uata loimat evolves, so you have to wiite youi mappei to cope with all ol youi
legacy loimats. Oi, you have uata souices that pioviue the same type ol uata Lut in
uilleient loimats. This aiises in the case ol peiloiming joins ol uilleient uatasets; see
¨Reuuce-Siue ]oins¨ on page 2S+. Foi instance, one might Le taL-sepaiateu plain text,
the othei a Linaiy seguence lile. Even il they aie in the same loimat, they may have
uilleient iepiesentations anu, theieloie, neeu to Le paiseu uilleiently.
These cases aie hanuleu elegantly Ly using the MultipleInputs class, which allows you
to specily the InputFormat anu Mapper to use on a pei-path Lasis. Foi example, il we
hau weathei uata liom the UK Met Ollice
that we wanteu to comLine with the NCDC
uata loi oui maximum tempeiatuie analysis, then we might set up the input as lollows:
6. Met Ollice uata is geneially availaLle only to the ieseaich anu acauemic community. Howevei, theie is a
small amount ol monthly weathei station uata availaLle at http://www.nctojjicc.gov.u|/c|inatc/u|/
248 | Chapter 7: MapReduce Types and Formats
MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, MaxTemperatureMapper.class);
MultipleInputs.addInputPath(job, metOfficeInputPath,
TextInputFormat.class, MetOfficeMaxTemperatureMapper.class);
This coue ieplaces the usual calls to FileInputFormat.addInputPath() anu job.setMap
perClass(). Both Met Ollice anu NCDC uata is text-Laseu, so we use TextInputFor
mat loi each. But the line loimat ol the two uata souices is uilleient, so we use two
uilleient mappeis. The MaxTemperatureMapper ieaus NCDC input uata anu extiacts the
yeai anu tempeiatuie lielus. The MetOfficeMaxTemperatureMapper ieaus Met Ollice in-
put uata anu extiacts the yeai anu tempeiatuie lielus. The impoitant thing is that the
map outputs have the same types, since the ieuuceis (which aie all ol the same type)
see the aggiegateu map outputs anu aie not awaie ol the uilleient mappeis useu to
piouuce them.
The MultipleInputs class has an oveiloaueu veision ol addInputPath() that uoesn`t take
a mappei:
public static void addInputPath(Job job, Path path,
Class<? extends InputFormat> inputFormatClass)
This is uselul when you only have one mappei (set using the Job`s setMapperClass()
methou) Lut multiple input loimats.
Database Input (and Output)
DBInputFormat is an input loimat loi ieauing uata liom a ielational uataLase, using
]DBC. Because it uoesn`t have any shaiuing capaLilities, you neeu to Le caielul not to
oveiwhelm the uataLase you aie ieauing liom Ly iunning too many mappeis. Foi this
ieason, it is Lest useu loi loauing ielatively small uatasets, peihaps loi joining with
laigei uatasets liom HDFS, using MultipleInputs. The coiiesponuing output loimat is
DBOutputFormat, which is uselul loi uumping joL outputs (ol mouest size) into a
Foi an alteinative way ol moving uata Letween ielational uataLases anu HDFS, consiuei
using Sgoop, which is uesciiLeu in Chaptei 15.
HBase`s TableInputFormat is uesigneu to allow a MapReuuce piogiam to opeiate on
uata stoieu in an HBase taLle. TableOutputFormat is loi wiiting MapReuuce outputs
into an HBase taLle.
Output Formats
Hauoop has output uata loimats that coiiesponu to the input loimats coveieu in the
pievious section. The OutputFormat class hieiaichy appeais in Figuie 7-+.
7. Instiuctions loi how to use these loimats aie pioviueu in ¨DataLase Access with Hauoop,¨ http://www
.c|oudcra.con/b|og/2009/03/0ó/databasc-acccss-with-hadoop/, Ly Aaion KimLall.
Output Formats | 249
Iigurc 7-1. OutputIornat c|ass hicrarchy
Text Output
The uelault output loimat, TextOutputFormat, wiites iecoius as lines ol text. Its keys
anu values may Le ol any type, since TextOutputFormat tuins them to stiings Ly calling
toString() on them. Each key-value paii is sepaiateu Ly a taL chaiactei, although that
may Le changeu using the mapreduce.output.textoutputformat.separator piopeity
(mapred.textoutputformat.separator in the olu API). The counteipait to TextOutput
Format loi ieauing in this case is KeyValueTextInputFormat, since it Lieaks lines into key-
value paiis Laseu on a conliguiaLle sepaiatoi (see ¨KeyValueTextInputFoi-
mat¨ on page 2+5).
You can suppiess the key oi the value (oi Loth, making this output loimat eguivalent
to NullOutputFormat, which emits nothing) liom the output using a NullWritable type.
This also causes no sepaiatoi to Le wiitten, which makes the output suitaLle loi ieauing
in using TextInputFormat.
250 | Chapter 7: MapReduce Types and Formats
Binary Output
As the name inuicates, SequenceFileOutputFormat wiites seguence liles loi its output.
This is a goou choice ol output il it loims the input to a luithei MapReuuce joL, since
it is compact anu is ieauily compiesseu. Compiession is contiolleu via the static
methous on SequenceFileOutputFormat, as uesciiLeu in ¨Using Compiession in Map-
Reuuce¨ on page 92. Foi an example ol how to use SequenceFileOutputFormat, see
¨Soiting¨ on page 266.
SequenceFileAsBinaryOutputFormat is the counteipait to SequenceFileAsBinaryInput
Format, anu it wiites keys anu values in iaw Linaiy loimat into a SeguenceFile containei.
MapFileOutputFormat wiites MapFiles as output. The keys in a MapFile must Le auueu
in oiuei, so you neeu to ensuie that youi ieuuceis emit keys in soiteu oiuei.
The ieuuce input keys aie guaianteeu to Le soiteu, Lut the output keys
aie unuei the contiol ol the ieuuce lunction, anu theie is nothing in the
geneial MapReuuce contiact that states that the ieuuce output keys have
to Le oiueieu in any way. The extia constiaint ol soiteu ieuuce output
keys is just neeueu loi MapFileOutputFormat.
Multiple Outputs
FileOutputFormat anu its suLclasses geneiate a set ol liles in the output uiiectoiy. Theie
is one lile pei ieuucei, anu liles aie nameu Ly the paitition numLei: part-r-00000, part-
r-00001, etc. Theie is sometimes a neeu to have moie contiol ovei the naming ol the
liles oi to piouuce multiple liles pei ieuucei. MapReuuce comes with the MultipleOut
puts class to help you uo this.
An example: Partitioning data
Consiuei the pioLlem ol paititioning the weathei uataset Ly weathei station. Ve woulu
like to iun a joL whose output is a lile pei station, with each lile containing all the
iecoius loi that station.
S. In the olu MapReuuce API theie aie two classes loi piouucing multiple outputs: MultipleOutputFormat
anu MultipleOutputs. In a nutshell, MultipleOutputs is moie lully leatuieu, Lut MultipleOutputFormat has
moie contiol ovei the output uiiectoiy stiuctuie anu lile naming. MultipleOutputs in the new API
comLines the Lest leatuies ol the two multiple output classes in the olu API.
Output Formats | 251
One way ol uoing this is to have a ieuucei loi each weathei station. To aiiange this,
we neeu to uo two things. Fiist, wiite a paititionei that puts iecoius liom the same
weathei station into the same paitition. Seconu, set the numLei ol ieuuceis on the joL
to Le the numLei ol weathei stations. The paititionei woulu look like this:
public class StationPartitioner extends Partitioner<LongWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

public int getPartition(LongWritable key, Text value, int numPartitions) {
return getPartition(parser.getStationId());
private int getPartition(String stationId) {
The getPartition(String) methou, whose implementation is not shown, tuins the
station ID into a paitition inuex. To uo this, it neeus a list ol all the station IDs anu
then just ietuins the inuex ol the station ID in the list.
Theie aie two uiawLacks to this appioach. The liist is that since the numLei ol paiti-
tions neeus to Le known Leloie the joL is iun, so uoes the numLei ol weathei stations.
Although the NCDC pioviues metauata aLout its stations, theie is no guaiantee that
the IDs encounteieu in the uata match those in the metauata. A station that appeais in
the metauata Lut not in the uata wastes a ieuucei slot. Voise, a station that appeais
in the uata Lut not in the metauata uoesn`t get a ieuucei slot÷it has to Le thiown away.
One way ol mitigating this pioLlem woulu Le to wiite a joL to extiact the unigue station
IDs, Lut it`s a shame that we neeu an extia joL to uo this.
The seconu uiawLack is moie suLtle. It is geneially a Lau iuea to allow the numLei ol
paititions to Le iigiuly lixeu Ly the application, since it can leau to small oi uneven-
sizeu paititions. Having many ieuuceis uoing a small amount ol woik isn`t an ellicient
way ol oiganizing a joL: it`s much Lettei to get ieuuceis to uo moie woik anu have
lewei ol them, as the oveiheau in iunning a task is then ieuuceu. Uneven-sizeu paiti-
tions can Le uillicult to avoiu, too. Dilleient weathei stations will have gatheieu a
wiuely vaiying amount ol uata: compaie a station that openeu one yeai ago to one that
has Leen gatheiing uata loi one centuiy. Il a lew ieuuce tasks take signilicantly longei
than the otheis, they will uominate the joL execution time anu cause it to Le longei
than it neeus to Le.
252 | Chapter 7: MapReduce Types and Formats
Theie aie two special cases when it uoes make sense to allow the ap-
plication to set the numLei ol paititions (oi eguivalently, the numLei
ol ieuuceis):
Zcro rcduccrs
This is a vacuous case: theie aie no paititions, as the application
neeus to iun only map tasks.
Onc rcduccr
It can Le convenient to iun small joLs to comLine the output ol
pievious joLs into a single lile. This shoulu only Le attempteu when
the amount ol uata is small enough to Le piocesseu comloitaLly
Ly one ieuucei.
It is much Lettei to let the clustei uiive the numLei ol paititions loi a joL÷the iuea
Leing that the moie clustei ieuuce slots aie availaLle the lastei the joL can complete.
This is why the uelault HashPartitioner woiks so well, as it woiks with any numLei ol
paititions anu ensuies each paitition has a goou mix ol keys leauing to moie even-sizeu
Il we go Lack to using HashPartitioner, each paitition will contain multiple stations,
so to cieate a lile pei station, we neeu to aiiange loi each ieuucei to wiite multiple liles,
which is wheie MultipleOutputs comes in.
MultipleOutputs allows you to wiite uata to liles whose names aie ueiiveu liom the
output keys anu values, oi in lact liom an aiLitiaiy stiing. This allows each ieuucei (oi
mappei in a map-only joL) to cieate moie than a single lile. File names aie ol the loim
name-n-nnnnn loi map outputs anu name-r-nnnnn loi ieuuce outputs, wheie name is an
aiLitiaiy name that is set Ly the piogiam, anu nnnnn is an integei uesignating the pait
numLei, staiting liom zeio. The pait numLei ensuies that outputs wiitten liom uil-
leient paititions (mappeis oi ieuuceis) uo not colliue in the case ol the same name.
The piogiam in Example 7-5 shows how to use MultipleOutputs to paitition the uataset
Ly station.
Exanp|c 7-5. Partitions who|c datasct into ji|cs nancd by thc station |D using Mu|tip|cOutputs
public class PartitionByStationUsingMultipleOutputs extends Configured
implements Tool {

static class StationMapper
extends Mapper<LongWritable, Text, Text, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
Output Formats | 253
context.write(new Text(parser.getStationId()), value);

static class MultipleOutputsReducer
extends Reducer<Text, Text, NullWritable, Text> {

private MultipleOutputs<NullWritable, Text> multipleOutputs;
protected void setup(Context context)
throws IOException, InterruptedException {
multipleOutputs = new MultipleOutputs<NullWritable, Text>(context);
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
multipleOutputs.write(NullWritable.get(), value, key.toString());

protected void cleanup(Context context)
throws IOException, InterruptedException {
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;

return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new PartitionByStationUsingMultipleOutputs(),
254 | Chapter 7: MapReduce Types and Formats
In the ieuucei, wheie we geneiate the output, we constiuct an instance ol MultipleOut
puts in the setup() methou anu assign it to an instance vaiiaLle. Ve then use the
MultipleOutputs instance in the reduce() methou to wiite to the output, in place ol the
context. The write() methou takes the key anu value, as well as a name. Ve use the
station iuentiliei loi the name, so the oveiall ellect is to piouuce output liles with the
naming scheme station_identifier-r-nnnnn.
In one iun, the liist lew output liles weie nameu as lollows (othei columns liom the
uiiectoiy listing have Leen uioppeu):
The Lase path specilieu in the write() methou ol MultipleOutputs is inteipieteu ielative
to the output uiiectoiy, anu since it may contain lile path sepaiatoi chaiacteis (/), it`s
possiLle to cieate suLuiiectoiies ol aiLitiaiy uepth. Foi example, the lollowing moui-
lication paititions the uata Ly station anu yeai so that each yeai`s uata is containeu in
a uiiectoiy nameu Ly the station ID (such as 029070-99999/1901/part-r-00000):
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
for (Text value : values) {
String basePath = String.format("%s/%s/part",
parser.getStationId(), parser.getYear());
multipleOutputs.write(NullWritable.get(), value, basePath);
MultipleOutputs uelegates to the mappei`s OutputFormat, which in this example is a
TextOutputFormat, Lut moie complex set ups aie possiLle. Foi example, you can cieate
nameu outputs, each with its own OutputFormat anu key anu value types (which may
uillei liom the output types ol the mappei oi ieuucei). Fuitheimoie, the mappei oi
ieuucei (oi Loth) may wiite to multiple output liles loi each iecoiu piocesseu. Please
consult the ]ava uocumentation loi moie inloimation.
Lazy Output
FileOutputFormat suLclasses will cieate output (part-r-nnnnn) liles, even il they aie
empty. Some applications pielei that empty liles not Le cieateu, which is wheie Lazy
OutputFormat helps. It is a wiappei output loimat that ensuies that the output lile is
Output Formats | 255
cieateu only when the liist iecoiu is emitteu loi a given paitition. To use it, call its
setOutputFormatClass() methou with the JobConf anu the unueilying output loimat.
Stieaming anu Pipes suppoit a -lazyOutput option to enaLle LazyOutputFormat.
Database Output
The output loimats loi wiiting to ielational uataLases anu to HBase aie mentioneu in
¨DataLase Input (anu Output)¨ on page 2+9.
256 | Chapter 7: MapReduce Types and Formats
MapReduce Features
This chaptei looks at some ol the moie auvanceu leatuies ol MapReuuce, incluuing
counteis anu soiting anu joining uatasets.
Theie aie olten things you woulu like to know aLout the uata you aie analyzing Lut
that aie peiipheial to the analysis you aie peiloiming. Foi example, il you weie counting
invaliu iecoius anu uiscoveieu that the piopoition ol invaliu iecoius in the whole ua-
taset was veiy high, you might Le piompteu to check why so many iecoius weie Leing
maikeu as invaliu÷peihaps theie is a Lug in the pait ol the piogiam that uetects invaliu
iecoius? Oi il the uata weie ol pooi guality anu genuinely uiu have veiy many invaliu
iecoius, altei uiscoveiing this, you might ueciue to inciease the size ol the uataset so
that the numLei ol goou iecoius was laige enough loi meaninglul analysis.
Counteis aie a uselul channel loi gatheiing statistics aLout the joL: loi guality contiol
oi loi application level-statistics. They aie also uselul loi pioLlem uiagnosis. Il you aie
tempteu to put a log message into youi map oi ieuuce task, then it is olten Lettei to
see whethei you can use a countei insteau to iecoiu that a paiticulai conuition occuiieu.
In auuition to countei values Leing much easiei to ietiieve than log output loi laige
uistiiLuteu joLs, you get a iecoiu ol the numLei ol times that conuition occuiieu, which
is moie woik to oLtain liom a set ol logliles.
Built-in Counters
Hauoop maintains some Luilt-in counteis loi eveiy joL, which iepoit vaiious metiics
loi youi joL. Foi example, theie aie counteis loi the numLei ol Lytes anu iecoius
piocesseu, which allows you to conliim that the expecteu amount ol input was con-
sumeu anu the expecteu amount ol output was piouuceu.
Counteis aie uiviueu into gioups, anu theie aie seveial gioups loi the Luilt-in counteis,
listeu in TaLle S-1.
Tab|c 8-1. Bui|t-in countcr groups
Group Name/Enum Reference
duce Task
org.apache.hadoop.mapred.Task$Counter (0.20)
org.apache.hadoop.mapreduce.TaskCounter (post 0.20)
Table 8-2
FileSystemCounters (0.20)
org.apache.hadoop.mapreduce.FileSystemCounter (post 0.20)
Table 8-3
org.apache.hadoop.mapred.FileInputFormat$Counter (0.20)
org.apache.hadoop.mapreduce.lib.input.FileInputFormatCounter (post
Table 8-4
org.apache.hadoop.mapred.FileOutputFormat$Counter (0.20)
(post 0.20)
Table 8-5
Job Coun-
org.apache.hadoop.mapred.JobInProgress$Counter (0.20)
org.apache.hadoop.mapreduce.JobCounter (post 0.20)
Table 8-6
Each gioup eithei contains tas| countcrs (which aie upuateu as a task piogiesses) oi
job countcrs (which aie upuateu as a joL piogiesses). Ve look at Loth types in the
lollowing sections.
Task counters
Task counteis gathei inloimation aLout tasks ovei the couise ol theii execution, anu
the iesults aie aggiegateu ovei all the tasks in a joL. Foi example, the
MAP_INPUT_RECORDS countei counts the input iecoius ieau Ly each map task anu aggie-
gates ovei all map tasks in a joL, so that the linal liguie is the total numLei ol input
iecoius loi the whole joL.
Task counteis aie maintaineu Ly each task attempt, anu peiiouically sent to the task-
tiackei anu then to the joLtiackei, so they can Le gloLally aggiegateu. (This is uesciiLeu
in ¨Piogiess anu Status Upuates¨ on page 192. Note that the inloimation llow is uil-
leient in YARN, see ¨YARN (MapReuuce 2)¨ on page 19+.) Task counteis aie sent in
lull eveiy time, iathei than senuing the counts since the last tiansmission, since this
guaius against eiiois uue to lost messages. Fuitheimoie, uuiing a joL iun, counteis
may go uown il a task lails.
Countei values aie uelinitive only once a joL has successlully completeu. Howevei,
some counteis pioviue uselul uiagnostic inloimation as a task is piogiessing, anu it can
Le uselul to monitoi them with the weL UI. Foi example, PHYSICAL_MEMORY_BYTES,
VIRTUAL_MEMORY_BYTES, anu COMMITTED_HEAP_BYTES pioviue an inuication ol how mem-
oiy usage vaiies ovei the couise ol a paiticulai task attempt.
The Luilt-in task counteis incluue those in the MapReuuce task counteis gioup (Ta-
Lle S-2) anu those in the lile-ielateu counteis gioups (TaLle S-3, TaLle S-+, TaLle S-5).
258 | Chapter 8: MapReduce Features
Tab|c 8-2. Bui|t-in MapRcducc tas| countcrs
Counter Description
Map input records
The number of input records consumed by all the maps in the job. Incremented every
time a record is read from a RecordReader and passed to the map’s map()
method by the framework.
Map skipped records
The number of input records skipped by all the maps in the job. See “Skipping Bad
Records” on page 217.
Map input bytes
The number of bytes of uncompressed input consumed by all the maps in the job.
Incremented every time a record is read from a RecordReader and passed to the
map’s map() method by the framework.
Split raw bytes
The number of bytes of input split objects read by maps. These objects represent
the split metadata (that is, the offset and length within a file) rather than the split
data itself, so the total size should be small.
Map output records
The number of map output records produced by all the maps in the job.
Incremented every time the collect() method is called on a map’s
Map output bytes
The number of bytes of uncompressed output produced by all the maps in the job.
Incremented every time the collect() method is called on a map’s Output
Map output materialized bytes
The number of bytes of map output actually written to disk. If map output com-
pression is enabled this is reflected in the counter value.
Combine input records
The number of input records consumed by all the combiners (if any) in the job.
Incremented every time a value is read from the combiner’s iterator over values.
Note that this count is the number of values consumed by the combiner, not the
number of distinct key groups (which would not be a useful metric, since there is
not necessarily one group per key for a combiner; see “Combiner Func-
tions” on page 34, and also “Shuffle and Sort” on page 205).
Combine output records
The number of output records produced by all the combiners (if any) in the job.
Incremented every time the collect() method is called on a combiner’s Out
Reduce input groups
The number of distinct key groups consumed by all the reducers in the job. Incre-
mented every time the reducer’s reduce() method is called by the framework.
Reduce input records
The number of input records consumed by all the reducers in the job. Incremented
every time a value is read from the reducer’s iterator over values. If reducers consume
all of their inputs, this count should be the same as the count for Map output records.
Reduce output records
The number of reduce output records produced by all the maps in the job.
Incremented every time the collect() method is called on a reducer’s
Reduce skipped groups
The number of distinct key groups skipped by all the reducers in the job. See “Skipping
Bad Records” on page 217.
Reduce skipped records
The number of input records skipped by all the reducers in the job.
Reduce shuffle bytes The number of bytes of map output copied by the shuffle to reducers.
Counters | 259
Counter Description
Spilled records
The number of records spilled to disk in all map and reduce tasks in the job.
CPU milliseconds
The cumulative CPU time for a task in milliseconds, as reported by /proc/cpuinfo.
Physical memory bytes
The physical memory being used by a task in bytes, as reported by /proc/meminfo.
Virtual memory bytes
The virtual memory being used by a task in bytes, as reported by /proc/meminfo.
Committed heap bytes
The total amount of memory available in the JVM in bytes, as reported by Run
GC time milliseconds
The elapsed time for garbage collection in tasks in milliseconds, as reported by
GarbageCollectorMXBean.getCollectionTime(). From 0.21.
Shuffled maps
The number of map output files transferred to reducers by the shuffle (see “Shuffle
and Sort” on page 205). From 0.21.
Failed shuffle
The number of map output copy failures during the shuffle. From 0.21.
Merged map outputs
The number of map outputs that have been merged on the reduce side of the shuffle.
From 0.21.
Tab|c 8-3. Bui|t-in ji|csystcn tas| countcrs
Counter Description
Filesystem bytes read
The number of bytes read by each filesystem by map and reduce tasks. There is a counter for each
filesystem: Filesystem may be Local, HDFS, S3, KFS, etc.
Filesystem bytes written
The number of bytes written by each filesystem by map and reduce tasks.
Tab|c 8-1. Bui|t-in Ii|c|nputIornat tas| countcrs
Counter Description
Bytes read
The number of bytes read by map tasks via the FileInputFormat.
Tab|c 8-5. Bui|t-in Ii|cOutputIornat tas| countcrs
Counter Description
Bytes written
The number of bytes written by map tasks (for map-only jobs) or reduce tasks via the FileOutputFormat.
260 | Chapter 8: MapReduce Features
Job counters
]oL counteis (TaLle S-6) aie maintaineu Ly the joLtiackei (oi application mastei in
YARN), so they uon`t neeu to Le sent acioss the netwoik, unlike all othei counteis,
incluuing usei-uelineu ones. They measuie joL-level statistics, not values that change
while a task is iunning. Foi example, TOTAL_LAUNCHED_MAPS counts the numLei ol map
tasks that weie launcheu ovei the couise ol a joL (incluuing ones that laileu).
Tab|c 8-ó. Bui|t-in job countcrs
Counter Description
Launched map tasks
The number of map tasks that were launched. Includes tasks that were
started speculatively.
Launched reduce tasks
The number of reduce tasks that were launched. Includes tasks that
were started speculatively.
Launched uber tasks
The number of uber tasks (see “YARN (MapReduce 2)” on page 194)
that were launched. From 0.23.
Maps in uber tasks
The number of maps in uber tasks. From 0.23.
Reduces in uber tasks
The number of reduces in uber tasks. From 0.23.
Failed map tasks
The number of map tasks that failed. See “Task Failure” on page 200
for potential causes.
Failed reduce tasks
The number of reduce tasks that failed.
Failed uber tasks
The number of uber tasks that failed. From 0.23.
Data-local map tasks
The number of map tasks that ran on the same node as their input data.
Rack-local map tasks
The number of map tasks that ran on a node in the same rack as their
input data, but that are not data-local.
Other local map tasks
The number of map tasks that ran on a node in a different rack to their
input data. Inter-rack bandwidth is scarce, and Hadoop tries to place
map tasks close to their input data, so this count should be low. See
Figure 2-2.
Total time in map tasks
The total time taken running map tasks in milliseconds. Includes tasks
that were started speculatively.
Total time in reduce tasks
The total time taken running reduce tasks in milliseconds. Includes
tasks that were started speculatively.
Total time in map tasks waiting after reserving slots
The total time spent waiting after reserving slots for map tasks in
milliseconds. Slot reservation is Capacity Scheduler feature for high-
memory jobs, see “Task memory limits” on page 316. Not used by
YARN-based MapReduce.
Counters | 261
Counter Description
Total time in reduce tasks waiting after reserving slots
The total time spent waiting after reserving slots for reduce tasks in
milliseconds. Slot reservation is Capacity Scheduler feature for high-
memory jobs, see “Task memory limits” on page 316. Not used by
YARN-based MapReduce.
User-Defined Java Counters
MapReuuce allows usei coue to ueline a set ol counteis, which aie then inciementeu
as uesiieu in the mappei oi ieuucei. Counteis aie uelineu Ly a ]ava enum, which seives
to gioup ielateu counteis. A joL may ueline an aiLitiaiy numLei ol enums, each with
an aiLitiaiy numLei ol lielus. The name ol the enum is the gioup name, anu the enum`s
lielus aie the countei names. Counteis aie gloLal: the MapReuuce liamewoik aggie-
gates them acioss all maps anu ieuuces to piouuce a gianu total at the enu ol the joL.
Ve cieateu some counteis in Chaptei 5 loi counting malloimeu iecoius in the weathei
uataset. The piogiam in Example S-1 extenus that example to count the numLei ol
missing iecoius anu the uistiiLution ol tempeiatuie guality coues.
Exanp|c 8-1. App|ication to run thc naxinun tcnpcraturc job, inc|uding counting nissing and
na|jorncd jic|ds and qua|ity codcs
public class MaxTemperatureWithCounters extends Configured implements Tool {

enum Temperature {

static class MaxTemperatureMapperWithCounters extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {

if (parser.isValidTemperature()) {
int airTemperature = parser.getAirTemperature();
output.collect(new Text(parser.getYear()),
new IntWritable(airTemperature));
} else if (parser.isMalformedTemperature()) {
System.err.println("Ignoring possibly corrupt input: " + value);
reporter.incrCounter(Temperature.MALFORMED, 1);
} else if (parser.isMissingTemperature()) {
reporter.incrCounter(Temperature.MISSING, 1);

// dynamic counter
reporter.incrCounter("TemperatureQuality", parser.getQuality(), 1);
262 | Chapter 8: MapReduce Features


public int run(String[] args) throws IOException {
JobConf conf = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (conf == null) {
return -1;

return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureWithCounters(), args);
The Lest way to see what this piogiam uoes is iun it ovei the complete uataset:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCounters input/ncdc/all output-counters
Vhen the joL has successlully completeu, it piints out the counteis at the enu (this is
uone Ly JobClient`s runJob() methou). Heie aie the ones we aie inteiesteu in:
09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality
09/04/20 06:33:36 INFO mapred.JobClient: 2=1246032
09/04/20 06:33:36 INFO mapred.JobClient: 1=973422173
09/04/20 06:33:36 INFO mapred.JobClient: 0=1
09/04/20 06:33:36 INFO mapred.JobClient: 6=40066
09/04/20 06:33:36 INFO mapred.JobClient: 5=158291879
09/04/20 06:33:36 INFO mapred.JobClient: 4=10764500
09/04/20 06:33:36 INFO mapred.JobClient: 9=66136858
09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Records
09/04/20 06:33:36 INFO mapred.JobClient: Malformed=3
09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856
Dynamic counters
The coue makes use ol a uynamic countei÷one that isn`t uelineu Ly a ]ava enum. Since
a ]ava enum`s lielus aie uelineu at compile time, you can`t cieate new counteis on the
lly using enums. Heie we want to count the uistiiLution ol tempeiatuie guality coues,
anu though the loimat specilication uelines the values that it can take, it is moie con-
Counters | 263
venient to use a uynamic countei to emit the values that it actually takes. The methou
we use on the Reporter oLject takes a gioup anu countei name using Stiing names:
public void incrCounter(String group, String counter, long amount)
The two ways ol cieating anu accessing counteis÷using enums anu using Stiings÷
aie actually eguivalent since Hauoop tuins enums into Stiings to senu counteis ovei
RPC. Enums aie slightly easiei to woik with, pioviue type salety, anu aie suitaLle loi
most joLs. Foi the ouu occasion when you neeu to cieate counteis uynamically, you
can use the Stiing inteilace.
Readable counter names
By uelault, a countei`s name is the enum`s lully gualilieu ]ava classname. These names
aie not veiy ieauaLle when they appeai on the weL UI, oi in the console, so Hauoop
pioviues a way to change the uisplay names using iesouice Lunules. Ve`ve uone this
heie, so we see ¨Aii Tempeiatuie Recoius¨ insteau ol ¨Tempeiatuie$MISSING.¨ Foi
uynamic counteis, the gioup anu countei names aie useu loi the uisplay names, so this
is not noimally an issue.
The iecipe to pioviue ieauaLle names is as lollows. Cieate a piopeities lile nameu altei
the enum, using an unueiscoie as a sepaiatoi loi nesteu classes. The piopeities lile
shoulu Le in the same uiiectoiy as the top-level class containing the enum. The lile is
nameu MaxTcnpcraturcWithCountcrs_Tcnpcraturc.propcrtics loi the counteis in Ex-
ample S-1.
The piopeities lile shoulu contain a single piopeity nameu CounterGroupName, whose
value is the uisplay name loi the whole gioup. Then each lielu in the enum shoulu have
a coiiesponuing piopeity uelineu loi it, whose name is the name ol the lielu sullixeu
with .name, anu whose value is the uisplay name loi the countei. Heie aie the contents
ol MaxTcnpcraturcWithCountcrs_Tcnpcraturc.propcrtics:
CounterGroupName=Air Temperature Records
Hauoop uses the stanuaiu ]ava localization mechanisms to loau the coiiect piopeities
loi the locale you aie iunning in, so, loi example, you can cieate a Chinese veision ol
the piopeities in a lile nameu MaxTcnpcraturcWithCountcrs_Tcnpcra-
turc_zh_CN.propcrtics, anu they will Le useu when iunning in the zh_CN locale. Relei
to the uocumentation loi java.util.PropertyResourceBundle loi moie inloimation.
Retrieving counters
In auuition to Leing availaLle via the weL UI anu the commanu line (using hadoop job
-counter), you can ietiieve countei values using the ]ava API. You can uo this while
the joL is iunning, although it is moie usual to get counteis at the enu ol a joL iun,
when they aie staLle. Example S-2 shows a piogiam that calculates the piopoition ol
iecoius that have missing tempeiatuie lielus.
264 | Chapter 8: MapReduce Features
Exanp|c 8-2. App|ication to ca|cu|atc thc proportion oj rccords with nissing tcnpcraturc jic|ds
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MissingTemperatureFields extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 1) {
JobBuilder.printUsage(this, "<job ID>");
return -1;
JobClient jobClient = new JobClient(new JobConf(getConf()));
String jobID = args[0];
RunningJob job = jobClient.getJob(JobID.forName(jobID));
if (job == null) {
System.err.printf("No job with ID %s found.\n", jobID);
return -1;
if (!job.isComplete()) {
System.err.printf("Job %s is not complete.\n", jobID);
return -1;

Counters counters = job.getCounters();
long missing = counters.getCounter(

long total = counters.findCounter("org.apache.hadoop.mapred.Task$Counter",
System.out.printf("Records with missing temperature fields: %.2f%%\n",
100.0 * missing / total);
return 0;
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MissingTemperatureFields(), args);
Fiist we ietiieve a RunningJob oLject liom a JobClient, Ly calling the getJob() methou
with the joL ID. Ve check whethei theie is actually a joL with the given ID. Theie may
not Le, eithei Lecause the ID was incoiiectly specilieu oi Lecause the joLtiackei no
longei has a ieleience to the joL (only the last 100 joLs aie kept in memoiy, contiolleu
Ly mapred.jobtracker.completeuserjobs.maximum, anu all aie cleaieu out il the joL-
tiackei is iestaiteu).
Altei conliiming that the joL has completeu, we call the RunningJob`s getCounters()
methou, which ietuins a Counters oLject, encapsulating all the counteis loi a joL. The
Counters class pioviues vaiious methous loi linuing the names anu values ol counteis.
Counters | 265
Ve use the getCounter() methou, which takes an enum to linu the numLei ol iecoius
that hau a missing tempeiatuie lielu.
Theie aie also findCounter() methous, all ol which ietuin a Counter oLject. Ve use
this loim to ietiieve the Luilt-in countei loi map input iecoius. To uo this, we ielei to
the countei Ly its gioup name÷the lully gualilieu ]ava classname loi the enum÷anu
countei name (Loth stiings).
Finally, we piint the piopoition ol iecoius that hau a missing tempeiatuie lielu. Heie`s
what we get loi the whole weathei uataset:
% hadoop jar hadoop-examples.jar MissingTemperatureFields job_200904200610_0003
Records with missing temperature fields: 5.47%
User-Defined Streaming Counters
A Stieaming MapReuuce piogiam can inciement counteis Ly senuing a specially loi-
matteu line to the stanuaiu eiioi stieam, which is co-opteu as a contiol channel in this
case. The line must have the lollowing loimat:
This snippet in Python shows how to inciement the ¨Missing¨ countei in the ¨Tem-
peiatuie¨ gioup Ly one:
In a similai way, a status message may Le sent with a line loimatteu like this:
The aLility to soit uata is at the heait ol MapReuuce. Even il youi application isn`t
conceineu with soiting pei se, it may Le aLle to use the soiting stage that MapReuuce
pioviues to oiganize its uata. In this section, we will examine uilleient ways ol soiting
uatasets anu how you can contiol the soit oiuei in MapReuuce. Soiting Avio uata is
coveieu sepaiately in ¨Soiting using Avio MapReuuce¨ on page 130.
Ve aie going to soit the weathei uataset Ly tempeiatuie. Stoiing tempeiatuies as
Text oLjects uoesn`t woik loi soiting puiposes, since signeu integeis uon`t soit
Insteau, we aie going to stoie the uata using seguence liles whose
1. The Luilt-in counteis` enums aie not cuiiently a pait ol the puLlic API, so this is the only way to ietiieve
them. Fiom ielease 0.21.0, counteis aie availaLle via the JobCounter anu TaskCounter enums in the
org.apache.hadoop.mapreduce package.
266 | Chapter 8: MapReduce Features
IntWritable keys iepiesent the tempeiatuie (anu soit coiiectly), anu whose Text values
aie the lines ol uata.
The MapReuuce joL in Example S-3 is a map-only joL that also lilteis the input to
iemove iecoius that uon`t have a valiu tempeiatuie ieauing. Each map cieates a single
Llock-compiesseu seguence lile as output. It is invokeu with the lollowing commanu:
% hadoop jar hadoop-examples.jar SortDataPreprocessor input/ncdc/all \
Exanp|c 8-3. A MapRcducc progran jor transjorning thc wcathcr data into ScqucnccIi|c jornat
public class SortDataPreprocessor extends Configured implements Tool {

static class CleanerMapper
extends Mapper<LongWritable, Text, IntWritable, Text> {

private NcdcRecordParser parser = new NcdcRecordParser();

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
context.write(new IntWritable(parser.getAirTemperature()), value);

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
return job.waitForCompletion(true) ? 0 : 1;
public static void main(String[] args) throws Exception {
2. One commonly useu woikaiounu loi this pioLlem÷paiticulaily in text-Laseu Stieaming applications÷
is to auu an ollset to eliminate all negative numLeis, anu lelt pau with zeios, so all numLeis aie the same
numLei ol chaiacteis. Howevei, see ¨Stieaming¨ on page 2S0 loi anothei appioach.
Sorting | 267
int exitCode = ToolRunner.run(new SortDataPreprocessor(), args);
Partial Sort
In ¨The Delault MapReuuce ]oL¨ on page 226, we saw that, Ly uelault, MapReuuce
will soit input iecoius Ly theii keys. Example S-+ is a vaiiation loi soiting seguence
liles with IntWritable keys.
Exanp|c 8-1. A MapRcducc progran jor sorting a ScqucnccIi|c with |ntWritab|c |cys using thc
dcjau|t HashPartitioncr
public class SortByTemperatureUsingHashPartitioner extends Configured
implements Tool {

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;

SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByTemperatureUsingHashPartitioner(),
Controlling Sort Order
The soit oiuei loi keys is contiolleu Ly a RawComparator, which is lounu as lollows:
1. Il the piopeity mapred.output.key.comparator.class is set, eithei explicitly oi Ly
calling setSortComparatorClass() on Job, then an instance ol that class is useu. (In
the olu API the eguivalent methou is setOutputKeyComparatorClass() on JobConf.)
2. Otheiwise, keys must Le a suLclass ol WritableComparable, anu the iegisteieu
compaiatoi loi the key class is useu.
268 | Chapter 8: MapReduce Features
3. Il theie is no iegisteieu compaiatoi, then a RawComparator is useu that ueseiializes
the Lyte stieams Leing compaieu into oLjects anu uelegates to the WritableCompar
able`s compareTo() methou.
These iules ieinloice why it`s impoitant to iegistei optimizeu veisions ol RawCompara
tors loi youi own custom Writable classes (which is coveieu in ¨Implementing a Raw-
Compaiatoi loi speeu¨ on page 10S), anu also that it`s stiaightloiwaiu to oveiiiue
the soit oiuei Ly setting youi own compaiatoi (we uo this in ¨Seconuaiy
Soit¨ on page 276).
Suppose we iun this piogiam using 30 ieuuceis:
% hadoop jar hadoop-examples.jar SortByTemperatureUsingHashPartitioner \
-D mapred.reduce.tasks=30 input/ncdc/all-seq output-hashsort
This commanu piouuces 30 output liles, each ol which is soiteu. Howevei, theie is no
easy way to comLine the liles (Ly concatenation, loi example, in the case ol plain-text
liles) to piouuce a gloLally soiteu lile. Foi many applications, this uoesn`t mattei. Foi
example, having a paitially soiteu set ol liles is line il you want to uo lookups.
An application: Partitioned MapFile lookups
To peiloim lookups Ly key, loi instance, having multiple liles woiks well. Il we change
the output loimat to Le a MapFileOutputFormat, as shown in Example S-5, then the
output is 30 map liles, which we can peiloim lookups against.
Exanp|c 8-5. A MapRcducc progran jor sorting a ScqucnccIi|c and producing MapIi|cs as output
public class SortByTemperatureToMapFile extends Configured implements Tool {

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;

SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
return job.waitForCompletion(true) ? 0 : 1;

3. See ¨Soiting anu meiging SeguenceFiles¨ on page 13S loi how to uo the same thing using the soit piogiam
example that comes with Hauoop.
Sorting | 269
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new SortByTemperatureToMapFile(), args);
MapFileOutputFormat pioviues a paii ol convenience static methous loi peiloiming
lookups against MapReuuce output; theii use is shown in Example S-6.
Exanp|c 8-ó. Rctricvc thc jirst cntry with a givcn |cy jron a co||cction oj MapIi|cs
public class LookupRecordByTemperature extends Configured implements Tool {

public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());
Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();
Writable entry =
MapFileOutputFormat.getEntry(readers, partitioner, key, val);
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
NcdcRecordParser parser = new NcdcRecordParser();
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new LookupRecordByTemperature(), args);
The getReaders() methou opens a MapFile.Reader loi each ol the output liles cieateu
Ly the MapReuuce joL. The getEntry() methou then uses the paititionei to choose the
ieauei loi the key anu linus the value loi that key Ly calling Reader`s get() methou. Il
getEntry() ietuins null, it means no matching key was lounu. Otheiwise, it ietuins
the value, which we tianslate into a station ID anu yeai.
To see this in action, let`s linu the liist entiy loi a tempeiatuie ol ÷10C (iememLei
that tempeiatuies aie stoieu as integeis iepiesenting tenths ol a uegiee, which is why
we ask loi a tempeiatuie ol –100):
270 | Chapter 8: MapReduce Features
% hadoop jar hadoop-examples.jar LookupRecordByTemperature output-hashmapsort -100
357460-99999 1956
Ve can also use the ieaueis uiiectly, in oiuei to get all the iecoius loi a given key. The
aiiay ol ieaueis that is ietuineu is oiueieu Ly paitition, so that the ieauei loi a given
key may Le lounu using the same paititionei that was useu in the MapReuuce joL:
Exanp|c 8-7. Rctricvc a|| cntrics with a givcn |cy jron a co||cction oj MapIi|cs
public class LookupRecordsByTemperature extends Configured implements Tool {
public int run(String[] args) throws Exception {
if (args.length != 2) {
JobBuilder.printUsage(this, "<path> <key>");
return -1;
Path path = new Path(args[0]);
IntWritable key = new IntWritable(Integer.parseInt(args[1]));

Reader[] readers = MapFileOutputFormat.getReaders(path, getConf());
Partitioner<IntWritable, Text> partitioner =
new HashPartitioner<IntWritable, Text>();
Text val = new Text();

Reader reader = readers[partitioner.getPartition(key, val, readers.length)];
Writable entry = reader.get(key, val);
if (entry == null) {
System.err.println("Key not found: " + key);
return -1;
NcdcRecordParser parser = new NcdcRecordParser();
IntWritable nextKey = new IntWritable();
do {
System.out.printf("%s\t%s\n", parser.getStationId(), parser.getYear());
} while(reader.next(nextKey, val) && key.equals(nextKey));
return 0;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new LookupRecordsByTemperature(), args);
Anu heie is a sample iun to ietiieve all ieauings ol ÷10C anu count them:
% hadoop jar hadoop-examples.jar LookupRecordsByTemperature output-hashmapsort -100 \
2> /dev/null | wc -l
Sorting | 271
Total Sort
How can you piouuce a gloLally soiteu lile using Hauoop? The naive answei is to use
a single paitition.
But this is incieuiLly inellicient loi laige liles, since one machine has
to piocess all ol the output, so you aie thiowing away the Lenelits ol the paiallel ai-
chitectuie that MapReuuce pioviues.
Insteau, it is possiLle to piouuce a set ol soiteu liles that, il concatenateu, woulu loim
a gloLally soiteu lile. The seciet to uoing this is to use a paititionei that iespects the
total oiuei ol the output. Foi example, il we hau loui paititions, we coulu put keys loi
tempeiatuies less than ÷10C in the liist paitition, those Letween ÷10C anu 0C in the
seconu, those Letween 0C anu 10C in the thiiu, anu those ovei 10C in the louith.
Although this appioach woiks, you have to choose youi paitition sizes caielully to
ensuie that they aie laiily even so that joL times aien`t uominateu Ly a single ieuucei.
Foi the paititioning scheme just uesciiLeu, the ielative sizes ol the paititions aie as
Temperature range < –10°C [–10°C, 0°C) [0°C, 10°C) >= 10°C
Proportion of records 11% 13% 17% 59%
These paititions aie not veiy even. To constiuct moie even paititions, we neeu to have
a Lettei unueistanuing ol the tempeiatuie uistiiLution loi the whole uataset. It`s laiily
easy to wiite a MapReuuce joL to count the numLei ol iecoius that lall into a collection
ol tempeiatuie Luckets. Foi example, Figuie S-1 shows the uistiiLution loi Luckets ol
size 1C, wheie each point on the plot coiiesponus to one Lucket.
+. A Lettei answei is to use Pig (¨Soiting Data¨ on page +05) oi Hive (¨Soiting anu
Aggiegating¨ on page ++1), Loth ol which can soit with a single commanu.
272 | Chapter 8: MapReduce Features
Iigurc 8-1. Tcnpcraturc distribution jor thc wcathcr datasct
Vhile we coulu use this inloimation to constiuct a veiy even set ol paititions, the lact
that we neeueu to iun a joL that useu the entiie uataset to constiuct them is not iueal.
It`s possiLle to get a laiily even set ol paititions, Ly sanp|ing the key space. The iuea
Lehinu sampling is that you look at a small suLset ol the keys to appioximate the key
uistiiLution, which is then useu to constiuct paititions. Luckily, we uon`t have to wiite
the coue to uo this ouiselves, as Hauoop comes with a selection ol sampleis.
The InputSampler class uelines a nesteu Sampler inteilace whose implementations
ietuin a sample ol keys given an InputFormat anu Job:
public interface Sampler<K, V> {
K[] getSample(InputFormat<K, V> inf, Job job)
throws IOException, InterruptedException;
This inteilace is not usually calleu uiiectly Ly clients. Insteau, the writePartition
File() static methou on InputSampler is useu, which cieates a seguence lile to stoie the
keys that ueline the paititions:
public static <K, V> void writePartitionFile(Job job, Sampler<K, V> sampler)
throws IOException, ClassNotFoundException, InterruptedException
The seguence lile is useu Ly TotalOrderPartitioner to cieate paititions loi the soit joL.
Example S-S puts it all togethei.
Sorting | 273
Exanp|c 8-8. A MapRcducc progran jor sorting a ScqucnccIi|c with |ntWritab|c |cys using thc
Tota|OrdcrPartitioncr to g|oba||y sort thc data
public class SortByTemperatureUsingTotalOrderPartitioner extends Configured
implements Tool {

public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;

SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
InputSampler.Sampler<IntWritable, Text> sampler =
new InputSampler.RandomSampler<IntWritable, Text>(0.1, 10000, 10);

Path input = FileInputFormat.getInputPaths(job)[0];
input = input.makeQualified(input.getFileSystem(getConf()));

Path partitionFile = new Path(input, "_partitions");
InputSampler.writePartitionFile(job, sampler);
// Add to DistributedCache
URI partitionUri = new URI(partitionFile.toString() + "#_partitions");
return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new SortByTemperatureUsingTotalOrderPartitioner(), args);
Ve use a RandomSampler, which chooses keys with a uniloim pioLaLility÷heie, 0.1.
Theie aie also paiameteis loi the maximum numLei ol samples to take anu the maxi-
mum numLei ol splits to sample (heie, 10,000 anu 10, iespectively; these settings aie
the uelaults when InputSampler is iun as an application), anu the samplei stops when
the liist ol these limits is met. Sampleis iun on the client, making it impoitant to limit
274 | Chapter 8: MapReduce Features
the numLei ol splits that aie uownloaueu, so the samplei iuns guickly. In piactice, the
time taken to iun the samplei is a small liaction ol the oveiall joL time.
The paitition lile that InputSampler wiites is calleu _partitions, which we have set to Le
in the input uiiectoiy (it will not Le pickeu up as an input lile since it staits with an
unueiscoie). To shaie the paitition lile with the tasks iunning on the clustei, we auu
it to the uistiiLuteu cache (see ¨DistiiLuteu Cache¨ on page 2SS).
On one iun, the samplei chose ÷5.6C, 13.9C, anu 22.0C as paitition Lounuaiies (loi
loui paititions), which tianslates into moie even paitition sizes than the eailiei choice
ol paititions:
Temperature range < –5.6°C [–5.6°C, 13.9°C) [13.9°C, 22.0°C) >= 22.0°C
Proportion of records 29% 24% 23% 24%
Youi input uata ueteimines the Lest samplei loi you to use. Foi example, SplitSam
pler, which samples only the liist n iecoius in a split, is not so goou loi soiteu uata
Lecause it uoesn`t select keys liom thioughout the split.
On the othei hanu, IntervalSampler chooses keys at iegulai inteivals thiough the split
anu makes a Lettei choice loi soiteu uata. RandomSampler is a goou geneial-puipose
samplei. Il none ol these suits youi application (anu iememLei that the point ol sam-
pling is to piouuce paititions that aie approxinatc|y egual in size), you can wiite youi
own implementation ol the Sampler inteilace.
One ol the nice piopeities ol InputSampler anu TotalOrderPartitioner is that you aie
liee to choose the numLei ol paititions. This choice is noimally uiiven Ly the numLei
ol ieuucei slots in youi clustei (choose a numLei slightly lewei than the total, to allow
loi lailuies). Howevei, TotalOrderPartitioner will woik only il the paitition
Lounuaiies aie uistinct: one pioLlem with choosing a high numLei is that you may get
collisions il you have a small key space.
Heie`s how we iun it:
% hadoop jar hadoop-examples.jar SortByTemperatureUsingTotalOrderPartitioner \
-D mapred.reduce.tasks=30 input/ncdc/all-seq output-totalsort
The piogiam piouuces 30 output paititions, each ol which is inteinally soiteu; in au-
uition, loi these paititions, all the keys in paitition i aie less than the keys in paitition
i - 1.
5. In some applications, it`s common loi some ol the input to alieauy Le soiteu, oi at least paitially soiteu.
Foi example, the weathei uataset is oiueieu Ly time, which may intiouuce ceitain Liases, making the
RandomSampler a salei choice.
Sorting | 275
Secondary Sort
The MapReuuce liamewoik soits the iecoius Ly key Leloie they ieach the ieuuceis.
Foi any paiticulai key, howevei, the values aie not soiteu. The oiuei that the values
appeai is not even staLle liom one iun to the next, since they come liom uilleient map
tasks, which may linish at uilleient times liom iun to iun. Geneially speaking, most
MapReuuce piogiams aie wiitten so as not to uepenu on the oiuei that the values
appeai to the ieuuce lunction. Howevei, it is possiLle to impose an oiuei on the values
Ly soiting anu giouping the keys in a paiticulai way.
To illustiate the iuea, consiuei the MapReuuce piogiam loi calculating the maximum
tempeiatuie loi each yeai. Il we aiiangeu loi the values (tempeiatuies) to Le soiteu in
uescenuing oiuei, we woulun`t have to iteiate thiough them to linu the maximum÷
we coulu take the liist loi each yeai anu ignoie the iest. (This appioach isn`t the most
ellicient way to solve this paiticulai pioLlem, Lut it illustiates how seconuaiy soit woiks
in geneial.)
To achieve this, we change oui keys to Le composite: a comLination ol yeai anu
tempeiatuie. Ve want the soit oiuei loi keys to Le Ly yeai (ascenuing) anu then Ly
tempeiatuie (uescenuing):
1900 35°C
1900 34°C
1900 34°C
1901 36°C
1901 35°C
Il all we uiu was change the key, then this woulun`t help since now iecoius loi the same
yeai woulu not (in geneial) go to the same ieuucei since they have uilleient keys. Foi
example, (1900, 35C) anu (1900, 3+C) coulu go to uilleient ieuuceis. By setting a
paititionei to paitition Ly the yeai pait ol the key, we can guaiantee that iecoius loi
the same yeai go to the same ieuucei. This still isn`t enough to achieve oui goal,
howevei. A paititionei ensuies only that one ieuucei ieceives all the iecoius loi a yeai;
it uoesn`t change the lact that the ieuucei gioups Ly key within the paitition:
The linal piece ol the puzzle is the setting to contiol the giouping. Il we gioup values
in the ieuucei Ly the yeai pait ol the key, then we will see all the iecoius loi the same
yeai in one ieuuce gioup. Anu since they aie soiteu Ly tempeiatuie in uescenuing oiuei,
the liist is the maximum tempeiatuie:
276 | Chapter 8: MapReduce Features
To summaiize, theie is a iecipe heie to get the ellect ol soiting Ly value:
º Make the key a composite ol the natuial key anu the natuial value.
º The soit compaiatoi shoulu oiuei Ly the composite key, that is, the natuial key
and natuial value.
º The paititionei anu giouping compaiatoi loi the composite key shoulu consiuei
only the natuial key loi paititioning anu giouping.
Java code
Putting this all togethei iesults in the coue in Example S-9. This piogiam uses the plain-
text input again.
Exanp|c 8-9. App|ication to jind thc naxinun tcnpcraturc by sorting tcnpcraturcs in thc |cy
public class MaxTemperatureUsingSecondarySort
extends Configured implements Tool {

static class MaxTemperatureMapper
extends Mapper<LongWritable, Text, IntPair, NullWritable> {

private NcdcRecordParser parser = new NcdcRecordParser();

protected void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
context.write(new IntPair(parser.getYearInt(),
parser.getAirTemperature()), NullWritable.get());

static class MaxTemperatureReducer
extends Reducer<IntPair, NullWritable, IntPair, NullWritable> {

protected void reduce(IntPair key, Iterable<NullWritable> values,
Context context) throws IOException, InterruptedException {

context.write(key, NullWritable.get());
Sorting | 277

public static class FirstPartitioner
extends Partitioner<IntPair, NullWritable> {
public int getPartition(IntPair key, NullWritable value, int numPartitions) {
// multiply by 127 to perform some mixing
return Math.abs(key.getFirst() * 127) % numPartitions;

public static class KeyComparator extends WritableComparator {
protected KeyComparator() {
super(IntPair.class, true);
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
int cmp = IntPair.compare(ip1.getFirst(), ip2.getFirst());
if (cmp != 0) {
return cmp;
return -IntPair.compare(ip1.getSecond(), ip2.getSecond()); //reverse

public static class GroupComparator extends WritableComparator {
protected GroupComparator() {
super(IntPair.class, true);
public int compare(WritableComparable w1, WritableComparable w2) {
IntPair ip1 = (IntPair) w1;
IntPair ip2 = (IntPair) w2;
return IntPair.compare(ip1.getFirst(), ip2.getFirst());
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;


278 | Chapter 8: MapReduce Features
return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MaxTemperatureUsingSecondarySort(), args);
In the mappei, we cieate a key iepiesenting the yeai anu tempeiatuie, using an IntPair
Writable implementation. (IntPair is like the TextPair class we uevelopeu in ¨Imple-
menting a Custom ViitaLle¨ on page 105.) Ve uon`t neeu to caiiy any inloimation in
the value, since we can get the liist (maximum) tempeiatuie in the ieuucei liom the
key, so we use a NullWritable. The ieuucei emits the liist key, which uue to the sec-
onuaiy soiting, is an IntPair loi the yeai anu its maximum tempeiatuie. IntPair`s
toString() methou cieates a taL-sepaiateu stiing, so the output is a set ol taL-sepaiateu
yeai-tempeiatuie paiis.
Many applications neeu to access all the soiteu values, not just the liist
value as we have pioviueu heie. To uo this, you neeu to populate the
value lielus since in the ieuucei you can ietiieve only the liist key. This
necessitates some unavoiuaLle uuplication ol inloimation Letween key
anu value.
Ve set the paititionei to paitition Ly the liist lielu ol the key (the yeai), using a custom
paititionei calleu FirstPartitioner. To soit keys Ly yeai (ascenuing) anu tempeiatuie
(uescenuing), we use a custom soit compaiatoi, using setSortComparatorClass(), that
extiacts the lielus anu peiloims the appiopiiate compaiisons. Similaily, to gioup keys
Ly yeai, we set a custom compaiatoi, using setGroupingComparatorClass(), to extiact
the liist lielu ol the key loi compaiison.
Running this piogiam gives the maximum tempeiatuies loi each yeai:
% hadoop jar hadoop-examples.jar MaxTemperatureUsingSecondarySort input/ncdc/all \
> output-secondarysort
% hadoop fs -cat output-secondarysort/part-* | sort | head
1901 317
1902 244
1903 289
1904 256
1905 283
1906 294
1907 283
1908 289
1909 278
1910 294
6. Foi simplicity, these custom compaiatois as shown aie not optimizeu; see ¨Implementing a
RawCompaiatoi loi speeu¨ on page 10S loi the steps we woulu neeu to take to make them lastei.
Sorting | 279
To uo a seconuaiy soit in Stieaming, we can take auvantage ol a couple ol liLiaiy classes
that Hauoop pioviues. Heie`s the uiivei that we can use to uo a seconuaiy soit:
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \
-D stream.num.map.output.key.fields=2 \
-D mapred.text.key.partitioner.options=-k1,1 \
-D mapred.output.key.comparator.class=\
org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
-D mapred.text.key.comparator.options="-k1n -k2nr" \
-input input/ncdc/all \
-output output_secondarysort_streaming \
-mapper ch08/src/main/python/secondary_sort_map.py \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-reducer ch08/src/main/python/secondary_sort_reduce.py \
-file ch08/src/main/python/secondary_sort_map.py \
-file ch08/src/main/python/secondary_sort_reduce.py

Oui map lunction (Example S-10) emits iecoius with yeai anu tempeiatuie lielus. Ve
want to tieat the comLination ol Loth ol these lielus as the key, so we set
stream.num.map.output.key.fields to 2. This means that values will Le empty, just like
in the ]ava case.
Exanp|c 8-10. Map junction jor sccondary sort in Python
#!/usr/bin/env python
import re
import sys
for line in sys.stdin:
val = line.strip()
(year, temp, q) = (val[15:19], int(val[87:92]), val[92:93])
if temp == 9999:
elif re.match("[01459]", q):
print "%s\t%s" % (year, temp)
Howevei, we uon`t want to paitition Ly the entiie key, so we use the KeyFieldBased
Partitioner paititionei, which allows us to paitition Ly a pait ol the key. The specili-
cation mapred.text.key.partitioner.options conliguies the paititionei. The value
-k1,1 instiucts the paititionei to use only the liist lielu ol the key, wheie lielus aie
assumeu to Le sepaiateu Ly a stiing uelineu Ly the map.output.key.field.separator
piopeity (a taL chaiactei Ly uelault).
Next, we want a compaiatoi that soits the yeai lielu in ascenuing oiuei anu the tem-
peiatuie lielu in uescenuing oiuei, so that the ieuuce lunction can simply ietuin the
liist iecoiu in each gioup. Hauoop pioviues KeyFieldBasedComparator, which is iueal
loi this puipose. The compaiison oiuei is uelineu Ly a specilication that is like the one
useu loi GNU sort. It is set using the mapred.text.key.comparator.options piopeity.
The value -k1n -k2nr useu in this example means ¨soit Ly the liist lielu in numeiical
280 | Chapter 8: MapReduce Features
oiuei, then Ly the seconu lielu in ieveise numeiical oiuei.¨ Like its paititionei cousin,
KeyFieldBasedPartitioner, it uses the sepaiatoi uelineu Ly the map.out
put.key.field.separator to split a key into lielus.
In the ]ava veision, we hau to set the giouping compaiatoi; howevei, in Stieaming,
gioups aie not uemaicateu in any way, so in the ieuuce lunction we have to uetect the
gioup Lounuaiies ouiselves Ly looking loi when the yeai changes (Example S-11).
Exanp|c 8-11. Rcduccr junction jor sccondary sort in Python
#!/usr/bin/env python
import sys
last_group = None
for line in sys.stdin:
val = line.strip()
(year, temp) = val.split("\t")
group = year
if last_group != group:
print val
last_group = group
Vhen we iun the stieaming piogiam, we get the same output as the ]ava veision.
Finally, note that KeyFieldBasedPartitioner anu KeyFieldBasedComparator aie not con-
lineu to use in Stieaming piogiams÷they aie applicaLle to ]ava MapReuuce piogiams,
MapReuuce can peiloim joins Letween laige uatasets, Lut wiiting the coue to uo joins
liom sciatch is laiily involveu. Rathei than wiiting MapReuuce piogiams, you might
consiuei using a highei-level liamewoik such as Pig, Hive, oi Cascauing, in which join
opeiations aie a coie pait ol the implementation.
Let`s Liielly consiuei the pioLlem we aie tiying to solve. Ve have two uatasets; loi
example, the weathei stations uataLase anu the weathei iecoius÷anu we want to iec-
oncile the two. Foi example, we want to see each station`s histoiy, with the station`s
metauata inlineu in each output iow. This is illustiateu in Figuie S-2.
How we implement the join uepenus on how laige the uatasets aie anu how they aie
paititioneu. Il one uataset is laige (the weathei iecoius) Lut the othei one is small
enough to Le uistiiLuteu to each noue in the clustei (as the station metauata is), then
the join can Le ellecteu Ly a MapReuuce joL that Liings the iecoius loi each station
togethei (a paitial soit on station ID, loi example). The mappei oi ieuucei uses the
smallei uataset to look up the station metauata loi a station ID, so it can Le wiitten out
with each iecoiu. See ¨Siue Data DistiiLution¨ on page 2S7 loi a uiscussion ol this
appioach, wheie we locus on the mechanics ol uistiiLuting the uata to tasktiackeis.
Joins | 281
Il the join is peiloimeu Ly the mappei, it is calleu a nap-sidc join, wheieas il it is
peiloimeu Ly the ieuucei it is calleu a rcducc-sidc join.
Il Loth uatasets aie too laige loi eithei to Le copieu to each noue in the clustei, then
we can still join them using MapReuuce with a map-siue oi ieuuce-siue join, uepenuing
on how the uata is stiuctuieu. One common example ol this case is a usei uataLase anu
a log ol some usei activity (such as access logs). Foi a populai seivice, it is not leasiLle
to uistiiLute the usei uataLase (oi the logs) to all the MapReuuce noues.
Map-Side Joins
A map-siue join Letween laige inputs woiks Ly peiloiming the join Leloie the uata
ieaches the map lunction. Foi this to woik, though, the inputs to each map must Le
paititioneu anu soiteu in a paiticulai way. Each input uataset must Le uiviueu into the
same numLei ol paititions, anu it must Le soiteu Ly the same key (the join key) in each
souice. All the iecoius loi a paiticulai key must iesiue in the same paitition. This may
sounu like a stiict ieguiiement (anu it is), Lut it actually lits the uesciiption ol the output
ol a MapReuuce joL.
282 | Chapter 8: MapReduce Features
Iigurc 8-2. |nncr join oj two datascts
A map-siue join can Le useu to join the outputs ol seveial joLs that hau the same numLei
ol ieuuceis, the same keys, anu output liles that aie not splittaLle (Ly Leing smallei
than an HDFS Llock, oi Ly viitue ol Leing gzip compiesseu, loi example). In the context
ol the weathei example, il we ian a paitial soit on the stations lile Ly station ID, anu
anothei, iuentical soit on the iecoius, again Ly station ID, anu with the same numLei
ol ieuuceis, then the two outputs woulu satisly the conuitions loi iunning a map-siue
Use a CompositeInputFormat liom the org.apache.hadoop.mapreduce.join package to
iun a map-siue join. The input souices anu join type (innei oi outei) loi CompositeIn
putFormat aie conliguieu thiough a join expiession that is wiitten accoiuing to a simple
giammai. The package uocumentation has uetails anu examples.
Joins | 283
The org.apache.hadoop.examples.Join example is a geneial-puipose commanu-line
piogiam loi iunning a map-siue join, since it allows you to iun a MapReuuce joL loi
any specilieu mappei anu ieuucei ovei multiple inputs that aie joineu with a given join
Reduce-Side Joins
A ieuuce-siue join is moie geneial than a map-siue join, in that the input uatasets uon`t
have to Le stiuctuieu in any paiticulai way, Lut it is less ellicient as Loth uatasets have
to go thiough the MapReuuce shullle. The Lasic iuea is that the mappei tags each iecoiu
with its souice anu uses the join key as the map output key, so that the iecoius with
the same key aie Liought togethei in the ieuucei. Ve use seveial ingieuients to make
this woik in piactice:
Mu|tip|c inputs
The input souices loi the uatasets have uilleient loimats, in geneial, so it is veiy
convenient to use the MultipleInputs class (see ¨Multiple Inputs¨ on page 2+S) to
sepaiate the logic loi paising anu tagging each souice.
Sccondary sort
As uesciiLeu, the ieuucei will see the iecoius liom Loth souices that have the same
key, Lut they aie not guaianteeu to Le in any paiticulai oiuei. Howevei, to peiloim
the join, it is impoitant to have the uata liom one souice Leloie anothei. Foi the
weathei uata join, the station iecoiu must Le the liist ol the values seen loi each
key, so the ieuucei can lill in the weathei iecoius with the station name anu emit
them stiaightaway. Ol couise, it woulu Le possiLle to ieceive the iecoius in any
oiuei il we Lulleieu them in memoiy, Lut this shoulu Le avoiueu, since the numLei
ol iecoius in any gioup may Le veiy laige anu exceeu the amount ol memoiy avail-
aLle to the ieuucei.
Ve saw in ¨Seconuaiy Soit¨ on page 276 how to impose an oiuei on the values
loi each key that the ieuuceis see, so we use this technigue heie.
To tag each iecoiu, we use TextPair liom Chaptei + loi the keys, to stoie the station
ID, anu the tag. The only ieguiiement loi the tag values is that they soit in such a way
that the station iecoius come Leloie the weathei iecoius. This can Le achieveu Ly
tagging station iecoius as 0 anu weathei iecoius as 1. The mappei classes to uo this aie
shown in Examples S-12 anu S-13.
Exanp|c 8-12. Mappcr jor tagging station rccords jor a rcducc-sidc join
public class JoinStationMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcStationMetadataParser parser = new NcdcStationMetadataParser();
7. The data_join package in the contrib uiiectoiy implements ieuuce-siue joins Ly Lulleiing iecoius in
memoiy, so it sulleis liom this limitation.
284 | Chapter 8: MapReduce Features
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
if (parser.parse(value)) {
context.write(new TextPair(parser.getStationId(), "0"),
new Text(parser.getStationName()));
Exanp|c 8-13. Mappcr jor tagging wcathcr rccords jor a rcducc-sidc join
public class JoinRecordMapper
extends Mapper<LongWritable, Text, TextPair, Text> {
private NcdcRecordParser parser = new NcdcRecordParser();

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
context.write(new TextPair(parser.getStationId(), "1"), value);
The ieuucei knows that it will ieceive the station iecoiu liist, so it extiacts its name
liom the value anu wiites it out as a pait ol eveiy output iecoiu (Example S-1+).
Exanp|c 8-11. Rcduccr jor joining taggcd station rccords with taggcd wcathcr rccords
public class JoinReducer extends Reducer<TextPair, Text, Text, Text> {
protected void reduce(TextPair key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
Iterator<Text> iter = values.iterator();
Text stationName = new Text(iter.next());
while (iter.hasNext()) {
Text record = iter.next();
Text outValue = new Text(stationName.toString() + "\t" + record.toString());
context.write(key.getFirst(), outValue);
The coue assumes that eveiy station ID in the weathei iecoius has exactly one matching
iecoiu in the station uataset. Il this weie not the case, we woulu neeu to geneialize the
coue to put the tag into the value oLjects, Ly using anothei TextPair. The reduce()
methou woulu then Le aLle to tell which entiies weie station names anu uetect (anu
hanule) missing oi uuplicate entiies, Leloie piocessing the weathei iecoius.
Joins | 285
Because oLjects in the ieuucei`s values iteiatoi aie ie-useu (loi elliciency
puiposes), it is vital that the coue makes a copy ol the liist Text oLject
liom the values iteiatoi:
Text stationName = new Text(iter.next());
Il the copy is not maue, then the stationName ieleience will ielei to the
value just ieau when it is tuineu into a stiing, which is a Lug.
Tying the joL togethei is the uiivei class, shown in Example S-15. The essential point
is that we paitition anu gioup on the liist pait ol the key, the station ID, which we uo
with a custom Partitioner (KeyPartitioner) anu a custom gioup compaiatoi, First
Comparator (liom TextPair).
Exanp|c 8-15. App|ication to join wcathcr rccords with station nancs
public class JoinRecordWithStationName extends Configured implements Tool {

public static class KeyPartitioner extends Partitioner<TextPair, Text> {
public int getPartition(TextPair key, Text value, int numPartitions) {
return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numPartitions;

public int run(String[] args) throws Exception {
if (args.length != 3) {
JobBuilder.printUsage(this, "<ncdc input> <station input> <output>");
return -1;

Job job = new Job(getConf(), "Join weather records with station names");

Path ncdcInputPath = new Path(args[0]);
Path stationInputPath = new Path(args[1]);
Path outputPath = new Path(args[2]);

MultipleInputs.addInputPath(job, ncdcInputPath,
TextInputFormat.class, JoinRecordMapper.class);
MultipleInputs.addInputPath(job, stationInputPath,
TextInputFormat.class, JoinStationMapper.class);
FileOutputFormat.setOutputPath(job, outputPath);




286 | Chapter 8: MapReduce Features
return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new JoinRecordWithStationName(), args);
Running the piogiam on the sample uata yielus the lollowing output:
011990-99999 SIHCCAJAVRI 0067011990999991950051507004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051512004+68750...
011990-99999 SIHCCAJAVRI 0043011990999991950051518004+68750...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032412004+62300...
012650-99999 TYNSET-HANSMOEN 0043012650999991949032418004+62300...
Side Data Distribution
Sidc data can Le uelineu as extia ieau-only uata neeueu Ly a joL to piocess the main
uataset. The challenge is to make siue uata availaLle to all the map oi ieuuce tasks
(which aie spieau acioss the clustei) in a convenient anu ellicient lashion.
In auuition to the uistiiLution mechanisms uesciiLeu in this section, it is possiLle to
cache siue-uata in memoiy in a static lielu, so that tasks ol the same joL that iun in
succession on the same tasktiackei can shaie the uata. ¨Task ]VM Re-
use¨ on page 216 uesciiLes how to enaLle this leatuie. Il you take this appioach, Le
awaie ol the amount ol memoiy that you aie using, as it might allect the memoiy neeueu
Ly the shullle (see ¨Shullle anu Soit¨ on page 205).
Using the Job Configuration
You can set aiLitiaiy key-value paiis in the joL conliguiation using the vaiious settei
methous on Configuration (oi JobConf in the olu MapReuuce API). This is veiy uselul
il you neeu to pass a small piece ol metauata to youi tasks.
In the task you can ietiieve the uata liom the conliguiation ietuineu Ly Context`s
getConfiguration() methou. (In the olu API, it`s a little moie involveu: oveiiiue the
configure() methou in the Mapper oi Reducer anu use a gettei methou on the JobConf
oLject passeu in to ietiieve the uata. It`s veiy common to stoie the uata in an instance
lielu so it can Le useu in the map() oi reduce() methou.)
Usually, a piimitive type is sullicient to encoue youi metauata, Lut loi aiLitiaiy oLjects
you can eithei hanule the seiialization youisell (il you have an existing mechanism loi
tuining oLjects to stiings anu Lack), oi you can use Hauoop`s Stringifier class.
DefaultStringifier uses Hauoop`s seiialization liamewoik to seiialize oLjects (see
¨Seiialization¨ on page 9+).
Side Data Distribution | 287
You shoulun`t use this mechanism loi tiansleiiing moie than a lew kiloLytes ol uata
Lecause it can put piessuie on the memoiy usage in the Hauoop uaemons, paiticulaily
in a system iunning hunuieus ol joLs. The joL conliguiation is ieau Ly the joLtiackei,
the tasktiackei, anu the chilu ]VM, anu each time the conliguiation is ieau, all ol its
entiies aie ieau into memoiy, even il they aie not useu. Usei piopeities aie not useu
Ly the joLtiackei oi the tasktiackei, so they just waste time anu memoiy.
Distributed Cache
Rathei than seiializing siue uata in the joL conliguiation, it is pieleiaLle to uistiiLute
uatasets using Hauoop`s uistiiLuteu cache mechanism. This pioviues a seivice loi
copying liles anu aichives to the task noues in time loi the tasks to use them when they
iun. To save netwoik Lanuwiuth, liles aie noimally copieu to any paiticulai noue once
pei joL.
Foi tools that use GenericOptionsParser (this incluues many ol the piogiams in this
Look÷see ¨GeneiicOptionsPaisei, Tool, anu ToolRunnei¨ on page 151), you can
specily the liles to Le uistiiLuteu as a comma-sepaiateu list ol URIs as the aigument to
the -files option. Files can Le on the local lilesystem, on HDFS, oi on anothei Hauoop
ieauaLle lilesystem (such as S3). Il no scheme is supplieu, then the liles aie assumeu to
Le local. (This is tiue even il the uelault lilesystem is not the local lilesystem.)
You can also copy aichive liles (]AR liles, ZIP liles, tai liles, anu gzippeu tai liles) to
youi tasks, using the -archives option; these aie unaichiveu on the task noue. The
-libjars option will auu ]AR liles to the classpath ol the mappei anu ieuucei tasks.
This is uselul il you haven`t Lunuleu liLiaiy ]AR liles in youi joL ]AR lile.
Stieaming uoesn`t use the uistiiLuteu cache loi copying the stieaming
sciipts acioss the clustei. You specily a lile to Le copieu using the
-file option (note the singulai), which shoulu Le iepeateu loi each lile
to Le copieu. Fuitheimoie, liles specilieu using the -file option must
Le lile paths only, not URIs, so they must Le accessiLle liom the local
lilesystem ol the client launching the Stieaming joL.
Stieaming also accepts the -files anu -archives options loi copying
liles into the uistiiLuteu cache loi use Ly youi Stieaming sciipts.
Let`s see how to use the uistiiLuteu cache to shaie a metauata lile loi station names.
The commanu we will iun is:
% hadoop jar hadoop-examples.jar MaxTemperatureByStationNameUsingDistributedCacheFile \
-files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output
288 | Chapter 8: MapReduce Features
This commanu will copy the local lile stations-jixcd-width.txt (no scheme is supplieu,
so the path is automatically inteipieteu as a local lile) to the task noues, so we can use
it to look up station names. The listing loi MaxTemperatureByStationNameUsingDistri
butedCacheFile appeais in Example S-16.
Exanp|c 8-1ó. App|ication to jind thc naxinun tcnpcraturc by station, showing station nancs jron
a |oo|up tab|c passcd as a distributcd cachc ji|c
public class MaxTemperatureByStationNameUsingDistributedCacheFile
extends Configured implements Tool {

static class StationTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private NcdcRecordParser parser = new NcdcRecordParser();

protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {

if (parser.isValidTemperature()) {
context.write(new Text(parser.getStationId()),
new IntWritable(parser.getAirTemperature()));

static class MaxTemperatureReducerWithStationLookup
extends Reducer<Text, IntWritable, Text, IntWritable> {

private NcdcStationMetadata metadata;

protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
metadata.initialize(new File("stations-fixed-width.txt"));
protected void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {

String stationName = metadata.getStationName(key.toString());

int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get());
context.write(new Text(stationName), new IntWritable(maxValue));
Side Data Distribution | 289
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
if (job == null) {
return -1;


return job.waitForCompletion(true) ? 0 : 1;

public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(
new MaxTemperatureByStationNameUsingDistributedCacheFile(), args);
The piogiam linus the maximum tempeiatuie Ly weathei station, so the mappei
(StationTemperatureMapper) simply emits (station ID, tempeiatuie) paiis. Foi the
comLinei, we ieuse MaxTemperatureReducer (liom Chapteis 2 anu 5) to pick the
maximum tempeiatuie loi any given gioup ol map outputs on the map siue. The ie-
uucei (MaxTemperatureReducerWithStationLookup) is uilleient liom the comLinei, since
in auuition to linuing the maximum tempeiatuie, it uses the cache lile to look up the
station name.
Ve use the ieuucei`s setup() methou to ietiieve the cache lile using its oiiginal name,
ielative to the woiking uiiectoiy ol the task.
You can use the uistiiLuteu cache loi copying liles that uo not lit in
memoiy. MapFiles aie veiy uselul in this iegaiu, since they seive as an
on-uisk lookup loimat (see ¨MapFile¨ on page 139). Because MapFiles
aie a collection ol liles with a uelineu uiiectoiy stiuctuie, you shoulu
put them into an aichive loimat (]AR, ZIP, tai, oi gzippeu tai) anu auu
them to the cache using the -archives option.
Heie`s a snippet ol the output, showing some maximum tempeiatuies loi a lew weathei
290 | Chapter 8: MapReduce Features
How it works
Vhen you launch a joL, Hauoop copies the liles specilieu Ly the -files, -archives anu
-libjars options to the joLtiackei`s lilesystem (noimally HDFS). Then, Leloie a task
is iun, the tasktiackei copies the liles liom the joLtiackei`s lilesystem to a local uisk÷
the cache÷so the task can access the liles. The liles aie saiu to Le |oca|izcd at this point.
Fiom the task`s point ol view, the liles aie just theie (anu it uoesn`t caie that they came
liom HDFS). In auuition, liles specilieu Ly -libjars aie auueu to the task`s classpath
Leloie it is launcheu.
The tasktiackei also maintains a ieleience count loi the numLei ol tasks using each
lile in the cache. Beloie the task has iun, the lile`s ieleience count is inciementeu Ly
one; then altei the task has iun, the count is uecieaseu Ly one. Only when the count
ieaches zeio is it eligiLle loi ueletion, since no tasks aie using it. Files aie ueleteu to
make ioom loi a new lile when the cache exceeus a ceitain size÷10 GB Ly uelault. The
cache size may Le changeu Ly setting the conliguiation piopeity local.cache.size,
which is measuieu in Lytes.
Although this uesign uoesn`t guaiantee that suLseguent tasks liom the same joL iun-
ning on the same tasktiackei will linu the lile in the cache, it is veiy likely that they will,
since tasks liom a joL aie usually scheuuleu to iun at aiounu the same time, so theie
isn`t the oppoitunity loi enough othei joLs to iun anu cause the oiiginal task`s lile to
Le ueleteu liom the cache.
Files aie localizeu unuei the ${mapred.local.dir}/taskTracker/archive uiiectoiy on
the tasktiackeis. Applications uon`t have to know this, howevei, since the liles aie
symLolically linkeu liom the task`s woiking uiiectoiy.
The distributed cache API
Most applications uon`t neeu to use the uistiiLuteu cache API Lecause they can use the
cache via GenericOptionsParser, as we saw in Example S-16. Howevei, some applica-
tions may neeu to use moie auvanceu leatuies ol the uistiiLuteu cache, anu loi this
they can use its API uiiectly. The API is in two paits: methous loi putting uata into the
cache (lounu in Job), anu methous loi ietiieving uata liom the cache (lounu in JobCon
Heie aie the peitinent methous in Job loi putting uata into the cache:
public void addCacheFile(URI uri)
public void addCacheArchive(URI uri)
public void setCacheFiles(URI[] files)
public void setCacheArchives(URI[] archives)
public void addFileToClassPath(Path file)
public void addArchiveToClassPath(Path archive)
public void createSymlink()
S. Il you aie using the olu MapReuuce API the same methous can Le lounu in
Side Data Distribution | 291
Recall that theie aie two types ol oLject that can Le placeu in the cache: liles anu
aichives. Files aie lelt intact on the task noue, while aichives aie unaichiveu on the task
noue. Foi each type ol oLject, theie aie thiee methous: an addCacheXXXX() methou to
auu the lile oi aichive to the uistiiLuteu cache, a setCacheXXXXs() methou to set the
entiie list ol liles oi aichives to Le auueu to the cache in a single call (ieplacing those
set in any pievious calls), anu an addXXXXToClassPath() to auu the lile oi aichive to the
MapReuuce task`s classpath. TaLle TaLle S-7 compaies these API methous to the
GenericOptionsParser options uesciiLeu in TaLle 5-1.
Tab|c 8-7. Distributcd cachc AP|
Job API method GenericOptionsParser equiva-
addCacheFile(URI uri)
setCacheFiles(URI[] files)
Add files to the distributed cache to
be copied to the task node.
addCacheArchive(URI uri)
setCacheArchives(URI[] files)
Add archives to the distributed
cache to be copied to the task node
and unarchived there.
addFileToClassPath(Path file) -libjars
Add files to the distributed cache to
be added to the MapReduce task’s
classpath. The files are not unarch-
ived, so this is a useful way to add
JAR files to the classpath.
addArchiveToClassPath(Path archive) None Add archives to the distributed
cache to be unarchived and added
to the MapReduce task’s classpath.
This can be useful when you want
to add a directory of files to the
classpath, since you can create an
archive containing the files, al-
though you can equally well create
a JAR file and use
The URIs ieleienceu in the add() oi set() methous must Le liles in a
shaieu lilesystem that exist when the joL is iun. On the othei hanu, the
liles specilieu as a GenericOptionsParser option (e.g. -files) may ielei
to a local lile, in which case they get copieu to the uelault shaieu lile-
system (noimally HDFS) on youi Lehall.
This is the key uilleience Letween using the ]ava API uiiectly anu using
GenericOptionsParser: the ]ava API uoes not copy the lile specilieu in
the add() oi set() methou to the shaieu lilesystem, wheieas the Gener
icOptionsParser uoes.
292 | Chapter 8: MapReduce Features
The iemaining uistiiLuteu cache API methou on Job is createSymlink(), which cieates
symLolic links loi all the liles loi the cuiient joL when they aie localizeu on the task
noue. The symLolic link name is set Ly the liagment iuentiliei ol the lile`s URI. Foi
example, the lile specilieu Ly the URI hdjs://nancnodc/joo/bar=nyji|c is symlinkeu as
nyji|c in the task`s woiking uiiectoiy. (Theie`s an example ol using this API in Exam-
ple S-S.) Il theie is no liagment iuentiliei, then no symLolic link is cieateu. Files auueu
to the uistiiLuteu cache using GenericOptionsParser aie automatically symlinkeu.
SymLolic links aie not cieateu loi liles in the uistiiLuteu cache when
using the local joL iunnei, so loi this ieason you may choose to use the
getLocalCacheFiles() anu getLocalCacheArchives() methous (uis-
cusseu Lelow) il you want youi joLs to woik Loth locally anu on a clus-
The seconu pait ol the uistiiLuteu cache API is lounu on JobContext, anu it is useu liom
the map oi ieuuce task coue when you want to access liles liom the uistiiLuteu cache.
public Path[] getLocalCacheFiles() throws IOException;
public Path[] getLocalCacheArchives() throws IOException;
public Path[] getFileClassPaths();
public Path[] getArchiveClassPaths();
Il the liles liom the uistiiLuteu cache have symLolic links in the task`s woiking uiiec-
toiy, then you can access the localizeu lile uiiectly Ly name, as we uiu in Exam-
ple S-16. It`s also possiLle to get a ieleience to liles anu aichives in the cache using the
getLocalCacheFiles() anu getLocalCacheArchives() methous. In the case ol aichives,
the paths ietuineu aie to the uiiectoiy containing the unaichiveu liles. (Foi complete-
ness, you can also ietiieve the liles anu aichives auueu to the task classpath via getFi
leClassPaths() anu getArchiveClassPaths().)
Note that liles aie ietuineu as |oca| Path oLjects. To ieau the liles you can use a Hauoop
local FileSystem instance, ietiieveu using its getLocal() methou. Alteinatively, you can
use the java.io.File API, as shown in this upuateu setup() methou loi MaxTemperatur
protected void setup(Context context)
throws IOException, InterruptedException {
metadata = new NcdcStationMetadata();
Path[] localPaths = context.getLocalCacheFiles();
if (localPaths.length == 0) {
throw new FileNotFoundException("Distributed cache file not found.");
File localFile = new File(localPaths[0].toUri());
Side Data Distribution | 293
MapReduce Library Classes
Hauoop comes with a liLiaiy ol mappeis anu ieuuceis loi commonly useu lunctions.
They aie listeu with Liiel uesciiptions in TaLle S-S. Foi luithei inloimation on how to
use them, please consult theii ]ava uocumentation.
Tab|c 8-8. MapRcducc |ibrary c|asscs
Classes Description
ChainMapper, ChainReducer Run a chain of mappers in a single mapper, and a reducer followed by a chain of
mappers in a single reducer. (Symbolically: M+RM*, where M is a mapper and R
is a reducer.) This can substantially reduce the amount of disk I/O incurred com-
pared to running multiple MapReduce jobs.
FieldSelectionMapReduce (old API)
FieldSelectionMapper and
FieldSelectionReducer (new API)
A mapper and a reducer that can select fields (like the Unix cut command) from
the input keys and values and emit them as output keys and values.
Reducers that sum integer values to produce a total for every key.
InverseMapper A mapper that swaps keys and values.
MultithreadedMapRunner (old API)
MultithreadedMapper (new API)
A mapper (or map runner in the old API) that runs mappers concurrently in
separate threads. Useful for mappers that are not CPU-bound.
TokenCounterMapper A mapper that tokenizes the input value into words (using Java’s
StringTokenizer) and emits each word along with a count of one.
RegexMapper A mapper that finds matches of a regular expression in the input value and emits
the matches along with a count of one.
294 | Chapter 8: MapReduce Features
Setting Up a Hadoop Cluster
This chaptei explains how to set up Hauoop to iun on a clustei ol machines. Running
HDFS anu MapReuuce on a single machine is gieat loi leaining aLout these systems,
Lut to uo uselul woik they neeu to iun on multiple noues.
Theie aie a lew options when it comes to getting a Hauoop clustei, liom Luiluing youi
own to iunning on ienteu haiuwaie, oi using an olleiing that pioviues Hauoop as a
seivice in the clouu. This chaptei anu the next give you enough inloimation to set up
anu opeiate youi own clustei, Lut even il you aie using a Hauoop seivice in which a
lot ol the ioutine maintenance is uone loi you, these chapteis still ollei valuaLle inloi-
mation aLout how Hauoop woiks liom an opeiations point ol view.
Cluster Specification
Hauoop is uesigneu to iun on commouity haiuwaie. That means that you aie not tieu
to expensive, piopiietaiy olleiings liom a single venuoi; iathei, you can choose stanu-
aiuizeu, commonly availaLle haiuwaie liom any ol a laige iange ol venuois to Luilu
youi clustei.
¨Commouity¨ uoes not mean ¨low-enu.¨ Low-enu machines olten have cheap com-
ponents, which have highei lailuie iates than moie expensive (Lut still commouity-
class) machines. Vhen you aie opeiating tens, hunuieus, oi thousanus ol machines,
cheap components tuin out to Le a lalse economy, as the highei lailuie iate incuis a
gieatei maintenance cost. On the othei hanu, laige uataLase class machines aie not
iecommenueu eithei, since they uon`t scoie well on the piice/peiloimance cuive. Anu
even though you woulu neeu lewei ol them to Luilu a clustei ol compaiaLle peiloi-
mance to one Luilt ol miu-iange commouity haiuwaie, when one uiu lail it woulu have
a Liggei impact on the clustei, since a laigei piopoition ol the clustei haiuwaie woulu
Le unavailaLle.
Haiuwaie specilications iapiuly Lecome oLsolete, Lut loi the sake ol illustiation, a
typical choice ol machine loi iunning a Hauoop uatanoue anu tasktiackei in miu-2010
woulu have the lollowing specilications:
2 guau-coie 2-2.5GHz CPUs
16-2+ GB ECC RAM
+ × 1TB SATA uisks
GigaLit Etheinet
Vhile the haiuwaie specilication loi youi clustei will assuieuly Le uilleient, Hauoop
is uesigneu to use multiple coies anu uisks, so it will Le aLle to take lull auvantage ol
moie poweilul haiuwaie.
Why Not Use RAID?
HDFS clusteis uo not Lenelit liom using RAID (Reuunuant Aiiay ol Inuepenuent
Disks) loi uatanoue stoiage (although RAID is iecommenueu loi the namenoue`s uisks,
to piotect against coiiuption ol its metauata). The ieuunuancy that RAID pioviues is
not neeueu, since HDFS hanules it Ly ieplication Letween noues.
Fuitheimoie, RAID stiiping (RAID 0), which is commonly useu to inciease peiloi-
mance, tuins out to Le s|owcr than the ]BOD (]ust a Bunch Ol Disks) conliguiation
useu Ly HDFS, which iounu-ioLins HDFS Llocks Letween all uisks. The ieason loi this
is that RAID 0 ieau anu wiite opeiations aie limiteu Ly the speeu ol the slowest uisk
in the RAID aiiay. In ]BOD, uisk opeiations aie inuepenuent, so the aveiage speeu ol
opeiations is gieatei than that ol the slowest uisk. Disk peiloimance olten shows con-
siueiaLle vaiiation in piactice, even loi uisks ol the same mouel. In some Lenchmaiking
caiiieu out on a Yahoo! clustei (http://nar|nai|.org/ncssagc/xnzc15zi25htr7ry),
]BOD peiloimeu 10º lastei than RAID 0 in one test (Giiumix), anu 30º Lettei in
anothei (HDFS wiite thioughput).
Finally, il a uisk lails in a ]BOD conliguiation, HDFS can continue to opeiate without
the laileu uisk, wheieas with RAID, lailuie ol a single uisk causes the whole aiiay (anu
hence the noue) to Lecome unavailaLle.
The Lulk ol Hauoop is wiitten in ]ava, anu can theieloie iun on any platloim with a
]VM, although theie aie enough paits that haiLoi Unix assumptions (the contiol
sciipts, loi example) to make it unwise to iun on a non-Unix platloim in piouuction.
1. ECC memoiy is stiongly iecommenueu, as seveial Hauoop useis have iepoiteu seeing many checksum
eiiois when using non-ECC memoiy on Hauoop clusteis.
296 | Chapter 9: Setting Up a Hadoop Cluster
In lact, Vinuows opeiating systems aie not suppoiteu piouuction platloims (although
they can Le useu with Cygwin as a uevelopment platloim; see Appenuix A).
How laige shoulu youi clustei Le? Theie isn`t an exact answei to this guestion, Lut the
Leauty ol Hauoop is that you can stait with a small clustei (say, 10 noues) anu giow it
as youi stoiage anu computational neeus giow. In many ways, a Lettei guestion is this:
how last uoes my clustei neeu to giow? You can get a goou leel loi this Ly consiueiing
stoiage capacity.
Foi example, il youi uata giows Ly 1 TB a week, anu you have thiee-way HDFS iepli-
cation, then you neeu an auuitional 3 TB ol iaw stoiage pei week. Allow some ioom
loi inteimeuiate liles anu logliles (aiounu 30º, say), anu this woiks out at aLout one
machine (2010 vintage) pei week, on aveiage. In piactice, you woulun`t Luy a new
machine each week anu auu it to the clustei. The value ol uoing a Lack-ol-the-envelope
calculation like this is that it gives you a leel loi how Lig youi clustei shoulu Le: in this
example, a clustei that holus two yeais ol uata neeus 100 machines.
Foi a small clustei (on the oiuei ol 10 noues), it is usually acceptaLle to iun the name-
noue anu the joLtiackei on a single mastei machine (as long as at least one copy ol the
namenoue`s metauata is stoieu on a iemote lilesystem). As the clustei anu the numLei
ol liles stoieu in HDFS giow, the namenoue neeus moie memoiy, so the namenoue
anu joLtiackei shoulu Le moveu onto sepaiate machines.
The seconuaiy namenoue can Le iun on the same machine as the namenoue, Lut again
loi ieasons ol memoiy usage (the seconuaiy has the same memoiy ieguiiements as the
piimaiy), it is Lest to iun it on a sepaiate piece ol haiuwaie, especially loi laigei clusteis.
(This topic is uiscusseu in moie uetail in ¨Mastei noue scenaiios¨ on page 30+.)
Machines iunning the namenoues shoulu typically iun on 6+-Lit haiuwaie to avoiu the
3 GB limit on ]ava heap size in 32-Lit aichitectuies.
Network Topology
A common Hauoop clustei aichitectuie consists ol a two-level netwoik topology, as
illustiateu in Figuie 9-1. Typically theie aie 30 to +0 seiveis pei iack, with a 1 GB switch
loi the iack (only thiee aie shown in the uiagiam), anu an uplink to a coie switch oi
ioutei (which is noimally 1 GB oi Lettei). The salient point is that the aggiegate Lanu-
wiuth Letween noues on the same iack is much gieatei than that Letween noues on
uilleient iacks.
2. The tiauitional auvice says othei machines in the clustei (joLtiackei, uatanoues/tasktiackeis) shoulu Le
32-Lit to avoiu the memoiy oveiheau ol laigei pointeis. Sun`s ]ava 6 upuate 1+ leatuies ¨compiesseu
oiuinaiy oLject pointeis,¨ which eliminates much ol this oveiheau, so theie`s now no ieal uownsiue to
iunning on 6+-Lit haiuwaie.
Cluster Specification | 297
Rack awareness
To get maximum peiloimance out ol Hauoop, it is impoitant to conliguie Hauoop so
that it knows the topology ol youi netwoik. Il youi clustei iuns on a single iack, then
theie is nothing moie to uo, since this is the uelault. Howevei, loi multiiack clusteis,
you neeu to map noues to iacks. By uoing this, Hauoop will pielei within-iack tiansleis
(wheie theie is moie Lanuwiuth availaLle) to oll-iack tiansleis when placing
MapReuuce tasks on noues. HDFS will Le aLle to place ieplicas moie intelligently to
tiaue-oll peiloimance anu iesilience.
Netwoik locations such as noues anu iacks aie iepiesenteu in a tiee, which iellects the
netwoik ¨uistance¨ Letween locations. The namenoue uses the netwoik location when
ueteimining wheie to place Llock ieplicas (see ¨Netwoik Topology anu Ha-
uoop¨ on page 71); the MapReuuce scheuulei uses netwoik location to ueteimine
wheie the closest ieplica is as input to a map task.
Foi the netwoik in Figuie 9-1, the iack topology is uesciiLeu Ly two netwoik locations,
say, /switch1/rac|1 anu /switch1/rac|2. Since theie is only one top-level switch in this
clustei, the locations can Le simplilieu to /rac|1 anu /rac|2.
The Hauoop conliguiation must specily a map Letween noue auuiesses anu netwoik
locations. The map is uesciiLeu Ly a ]ava inteilace, DNSToSwitchMapping, whose
signatuie is:
public interface DNSToSwitchMapping {
public List<String> resolve(List<String> names);
Iigurc 9-1. Typica| two-|cvc| nctwor| architccturc jor a Hadoop c|ustcr
298 | Chapter 9: Setting Up a Hadoop Cluster
The names paiametei is a list ol IP auuiesses, anu the ietuin value is a list ol coiie-
sponuing netwoik location stiings. The topology.node.switch.mapping.impl conligu-
iation piopeity uelines an implementation ol the DNSToSwitchMapping inteilace that the
namenoue anu the joLtiackei use to iesolve woikei noue netwoik locations.
Foi the netwoik in oui example, we woulu map nodc1, nodc2, anu nodc3 to /rac|1,
anu nodc1, nodc5, anu nodcó to /rac|2.
Most installations uon`t neeu to implement the inteilace themselves, howevei, since
the uelault implementation is ScriptBasedMapping, which iuns a usei-uelineu sciipt to
ueteimine the mapping. The sciipt`s location is contiolleu Ly the piopeity
topology.script.file.name. The sciipt must accept a vaiiaLle numLei ol aiguments
that aie the hostnames oi IP auuiesses to Le mappeu, anu it must emit the coiiesponu-
ing netwoik locations to stanuaiu output, sepaiateu Ly whitespace. The Hauoop wiki
has an example at http://wi|i.apachc.org/hadoop/topo|ogy_rac|_awarcncss_scripts.
Il no sciipt location is specilieu, the uelault Lehavioi is to map all noues to a single
netwoik location, calleu /dcjau|t-rac|.
Cluster Setup and Installation
Youi haiuwaie has aiiiveu. The next steps aie to get it iackeu up anu install the soltwaie
neeueu to iun Hauoop.
Theie aie vaiious ways to install anu conliguie Hauoop. This chaptei uesciiLes how
to uo it liom sciatch using the Apache Hauoop uistiiLution, anu will give you the
Lackgiounu to covei the things you neeu to think aLout when setting up Hauoop.
Alteinatively, il you woulu like to use RPMs oi DeLian packages loi managing youi
Hauoop installation, then you might want to stait with Clouueia`s DistiiLution, ue-
sciiLeu in Appenuix B.
To ease the Luiuen ol installing anu maintaining the same soltwaie on each noue, it is
noimal to use an automateu installation methou like Reu Hat Linux`s Kickstait oi
DeLian`s Fully Automatic Installation. These tools allow you to automate the opeiating
system installation Ly iecoiuing the answeis to guestions that aie askeu uuiing the
installation piocess (such as the uisk paitition layout), as well as which packages to
install. Ciucially, they also pioviue hooks to iun sciipts at the enu ol the piocess, which
aie invaluaLle loi uoing linal system tweaks anu customization that is not coveieu Ly
the stanuaiu installei.
The lollowing sections uesciiLe the customizations that aie neeueu to iun Hauoop.
These shoulu all Le auueu to the installation sciipt.
Cluster Setup and Installation | 299
Installing Java
]ava 6 oi latei is ieguiieu to iun Hauoop. The latest staLle Sun ]DK is the pieleiieu
option, although ]ava uistiiLutions liom othei venuois may woik, too. The lollowing
commanu conliims that ]ava was installeu coiiectly:
% java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) 64-Bit Server VM (build 11.2-b01, mixed mode)
Creating a Hadoop User
It`s goou piactice to cieate a ueuicateu Hauoop usei account to sepaiate the Hauoop
installation liom othei seivices iunning on the same machine.
Foi small clusteis, some auministiatois choose to make this usei`s home uiiectoiy an
NFS-mounteu uiive, to aiu with SSH key uistiiLution (see the lollowing uiscussion).
The NFS seivei is typically outsiue the Hauoop clustei. Il you use NFS, it is woith
consiueiing autols, which allows you to mount the NFS lilesystem on uemanu, when
the system accesses it. Autols pioviues some piotection against the NFS seivei lailing
anu allows you to use ieplicateu lilesystems loi lailovei. Theie aie othei NFS gotchas
to watch out loi, such as synchionizing UIDs anu GIDs. Foi help setting up NFS on
Linux, ielei to the HOVTO at http://njs.sourccjorgc.nct/njs-howto/indcx.htn|.
Installing Hadoop
Downloau Hauoop liom the Apache Hauoop ieleases page (http://hadoop.apachc.org/
corc/rc|cascs.htn|), anu unpack the contents ol the uistiiLution in a sensiLle location,
such as /usr/|oca| (/opt is anothei stanuaiu choice). Note that Hauoop is not installeu
in the hadoop usei`s home uiiectoiy, as that may Le an NFS-mounteu uiiectoiy:
% cd /usr/local
% sudo tar xzf hadoop-x.y.z.tar.gz
Ve also neeu to change the ownei ol the Hauoop liles to Le the hadoop usei anu gioup:
% sudo chown -R hadoop:hadoop hadoop-x.y.z
300 | Chapter 9: Setting Up a Hadoop Cluster
Some auministiatois like to install HDFS anu MapReuuce in sepaiate
locations on the same system. At the time ol this wiiting, only HDFS
anu MapReuuce liom the same Hauoop ielease aie compatiLle with one
anothei; howevei, in lutuie ieleases, the compatiLility ieguiiements will
Le looseneu. Vhen this happens, having inuepenuent installations
makes sense, as it gives moie upgiaue options (loi moie, see ¨Up-
giaues¨ on page 360). Foi example, it is convenient to Le aLle to up-
giaue MapReuuce÷peihaps to patch a Lug÷while leaving HDFS
Note that sepaiate installations ol HDFS anu MapReuuce can still shaie
conliguiation Ly using the --config option (when staiting uaemons) to
ielei to a common conliguiation uiiectoiy. They can also log to the same
uiiectoiy, as the logliles they piouuce aie nameu in such a way as to
avoiu clashes.
Testing the Installation
Once you`ve cieateu an installation sciipt, you aie ieauy to test it Ly installing it on the
machines in youi clustei. This will pioLaLly take a lew iteiations as you uiscovei kinks
in the install. Vhen it`s woiking, you can pioceeu to conliguie Hauoop anu give it a
test iun. This piocess is uocumenteu in the lollowing sections.
SSH Configuration
The Hauoop contiol sciipts (Lut not the uaemons) iely on SSH to peiloim clustei-wiue
opeiations. Foi example, theie is a sciipt loi stopping anu staiting all the uaemons in
the clustei. Note that the contiol sciipts aie optional÷clustei-wiue opeiations can Le
peiloimeu Ly othei mechanisms, too (such as a uistiiLuteu shell).
To woik seamlessly, SSH neeus to Le set up to allow passwoiu-less login loi the
hadoop usei liom machines in the clustei. The simplest way to achieve this is to geneiate
a puLlic/piivate key paii, anu place it in an NFS location that is shaieu acioss the clustei.
Fiist, geneiate an RSA key paii Ly typing the lollowing in the hadoop usei account:
% ssh-keygen -t rsa -f ~/.ssh/id_rsa
Even though we want passwoiu-less logins, keys without passphiases aie not consiu-
eieu goou piactice (it`s OK to have an empty passphiase when iunning a local pseuuo-
uistiiLuteu clustei, as uesciiLeu in Appenuix A), so we specily a passphiase when
piompteu loi one. Ve shall use ssh-agcnt to avoiu the neeu to entei a passwoiu loi
each connection.
The piivate key is in the lile specilieu Ly the -f option, -/.ssh/id_rsa, anu the puLlic key
is stoieu in a lile with the same name with .pub appenueu, -/.ssh/id_rsa.pub.
SSH Configuration | 301
Next we neeu to make suie that the puLlic key is in the -/.ssh/authorizcd_|cys lile on
all the machines in the clustei that we want to connect to. Il the hadoop usei`s home
uiiectoiy is an NFS lilesystem, as uesciiLeu eailiei, then the keys can Le shaieu acioss
the clustei Ly typing:
% cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Il the home uiiectoiy is not shaieu using NFS, then the puLlic keys will neeu to Le
shaieu Ly some othei means.
Test that you can SSH liom the mastei to a woikei machine Ly making suie ssh-
agcnt is iunning,
anu then iun ssh-add to stoie youi passphiase. You shoulu Le aLle
to ssh to a woikei without enteiing the passphiase again.
Hadoop Configuration
Theie aie a hanulul ol liles loi contiolling the conliguiation ol a Hauoop installation;
the most impoitant ones aie listeu in TaLle 9-1. This section coveis MapReuuce 1,
which employs the joLtiackei anu tasktiackei uaemons. Running MapReuuce 2 is
suLstantially uilleient, anu is coveieu in ¨YARN Conliguiation¨ on page 31S.
Tab|c 9-1. Hadoop conjiguration ji|cs
Filename Format Description
hadoop-env.sh Bash script Environment variables that are used in the scripts to run Hadoop.
core-site.xml Hadoop configuration
Configuration settings for Hadoop Core, such as I/O settings that are
common to HDFS and MapReduce.
hdfs-site.xml Hadoop configuration
Configuration settings for HDFS daemons: the namenode, the sec-
ondary namenode, and the datanodes.
mapred-site.xml Hadoop configuration
Configuration settings for MapReduce daemons: the jobtracker, and
the tasktrackers.
masters Plain text A list of machines (one per line) that each run a secondary
slaves Plain text A list of machines (one per line) that each run a datanode and a
hadoop-metrics.properties Java Properties Properties for controlling how metrics are published in Hadoop (see
“Metrics” on page 350).
log4j.properties Java Properties Properties for system logfiles, the namenode audit log, and the task
log for the tasktracker child process (“Hadoop Logs” on page 173).
These liles aie all lounu in the conj uiiectoiy ol the Hauoop uistiiLution. The conlig-
uiation uiiectoiy can Le ielocateu to anothei pait ol the lilesystem (outsiue the Hauoop
3. See its main page loi instiuctions on how to stait ssh-agcnt.
302 | Chapter 9: Setting Up a Hadoop Cluster
installation, which makes upgiaues maiginally easiei) as long as uaemons aie staiteu
with the --config option specilying the location ol this uiiectoiy on the local lilesystem.
Configuration Management
Hauoop uoes not have a single, gloLal location loi conliguiation inloimation. Insteau,
each Hauoop noue in the clustei has its own set ol conliguiation liles, anu it is up to
auministiatois to ensuie that they aie kept in sync acioss the system. Hauoop pioviues
a iuuimentaiy lacility loi synchionizing conliguiation using rsync (see upcoming uis-
cussion); alteinatively, theie aie paiallel shell tools that can help uo this, like dsh oi
Hauoop is uesigneu so that it is possiLle to have a single set ol conliguiation liles that
aie useu loi all mastei anu woikei machines. The gieat auvantage ol this is simplicity,
Loth conceptually (since theie is only one conliguiation to ueal with) anu opeiationally
(as the Hauoop sciipts aie sullicient to manage a single conliguiation setup).
Foi some clusteis, the one-size-lits-all conliguiation mouel Lieaks uown. Foi example,
il you expanu the clustei with new machines that have a uilleient haiuwaie specilica-
tion to the existing ones, then you neeu a uilleient conliguiation loi the new machines
to take auvantage ol theii extia iesouices.
In these cases, you neeu to have the concept ol a c|ass ol machine, anu maintain a
sepaiate conliguiation loi each class. Hauoop uoesn`t pioviue tools to uo this, Lut theie
aie seveial excellent tools loi uoing piecisely this type ol conliguiation management,
such as Chel, Puppet, clengine, anu Lclg2.
Foi a clustei ol any size, it can Le a challenge to keep all ol the machines in sync: consiuei
what happens il the machine is unavailaLle when you push out an upuate÷who en-
suies it gets the upuate when it Lecomes availaLle? This is a Lig pioLlem anu can leau
to uiveigent installations, so even il you use the Hauoop contiol sciipts loi managing
Hauoop, it may Le a goou iuea to use conliguiation management tools loi maintaining
the clustei. These tools aie also excellent loi uoing iegulai maintenance, such as patch-
ing secuiity holes anu upuating system packages.
Control scripts
Hauoop comes with sciipts loi iunning commanus, anu staiting anu stopping uaemons
acioss the whole clustei. To use these sciipts (which can Le lounu in the bin uiiectoiy),
you neeu to tell Hauoop which machines aie in the clustei. Theie aie two liles loi this
puipose, calleu nastcrs anu s|avcs, each ol which contains a list ol the machine host-
names oi IP auuiesses, one pei line. The nastcrs lile is actually a misleauing name, in
that it ueteimines which machine oi machines shoulu iun a seconuaiy namenoue. The
s|avcs lile lists the machines that the uatanoues anu tasktiackeis shoulu iun on. Both
nastcrs anu s|avcs liles iesiue in the conliguiation uiiectoiy, although the s|avcs lile
may Le placeu elsewheie (anu given anothei name) Ly changing the HADOOP_SLAVES
Hadoop Configuration | 303
setting in hadoop-cnv.sh. Also, these liles uo not neeu to Le uistiiLuteu to woikei noues,
since they aie useu only Ly the contiol sciipts iunning on the namenoue oi joLtiackei.
You uon`t neeu to specily which machine (oi machines) the namenoue anu joLtiackei
iuns on in the nastcrs lile, as this is ueteimineu Ly the machine the sciipts aie iun on.
(In lact, specilying these in the nastcrs lile woulu cause a seconuaiy namenoue to iun
theie, which isn`t always what you want.) Foi example, the start-djs.sh sciipt, which
staits all the HDFS uaemons in the clustei, iuns the namenoue on the machine the
sciipt is iun on. In slightly moie uetail, it:
1. Staits a namenoue on the local machine (the machine that the sciipt is iun on)
2. Staits a uatanoue on each machine listeu in the s|avcs lile
3. Staits a seconuaiy namenoue on each machine listeu in the nastcrs lile
Theie is a similai sciipt calleu start-naprcd.sh, which staits all the MapReuuce uae-
mons in the clustei. Moie specilically, it:
1. Staits a joLtiackei on the local machine
2. Staits a tasktiackei on each machine listeu in the s|avcs lile
Note that nastcrs is not useu Ly the MapReuuce contiol sciipts.
Also pioviueu aie stop-djs.sh anu stop-naprcd.sh sciipts to stop the uaemons staiteu
Ly the coiiesponuing stait sciipt.
These sciipts stait anu stop Hauoop uaemons using the hadoop-dacnon.sh sciipt. Il
you use the aloiementioneu sciipts, you shoulun`t call hadoop-dacnon.sh uiiectly. But
il you neeu to contiol Hauoop uaemons liom anothei system oi liom youi own sciipts,
then the hadoop-dacnon.sh sciipt is a goou integiation point. Likewise, hadoop-
dacnons.sh (with an ¨s¨) is hanuy loi staiting the same uaemon on a set ol hosts.
Master node scenarios
Depenuing on the size ol the clustei, theie aie vaiious conliguiations loi iunning the
mastei uaemons: the namenoue, seconuaiy namenoue, anu joLtiackei. On a small
clustei (a lew tens ol noues), it is convenient to put them on a single machine; howevei,
as the clustei gets laigei, theie aie goou ieasons to sepaiate them.
The namenoue has high memoiy ieguiiements, as it holus lile anu Llock metauata loi
the entiie namespace in memoiy. The seconuaiy namenoue, while iule most ol the time,
has a compaiaLle memoiy lootpiint to the piimaiy when it cieates a checkpoint. (This
is explaineu in uetail in ¨The lilesystem image anu euit log¨ on page 33S.) Foi lilesys-
tems with a laige numLei ol liles, theie may not Le enough physical memoiy on one
machine to iun Loth the piimaiy anu seconuaiy namenoue.
The seconuaiy namenoue keeps a copy ol the latest checkpoint ol the lilesystem met-
auata that it cieates. Keeping this (stale) Lackup on a uilleient noue to the namenoue
304 | Chapter 9: Setting Up a Hadoop Cluster
allows iecoveiy in the event ol loss (oi coiiuption) ol all the namenoue`s metauata liles.
(This is uiscusseu luithei in Chaptei 10.)
On a Lusy clustei iunning lots ol MapReuuce joLs, the joLtiackei uses consiueiaLle
memoiy anu CPU iesouices, so it shoulu iun on a ueuicateu noue.
Vhethei the mastei uaemons iun on one oi moie noues, the lollowing instiuctions
º Run the HDFS contiol sciipts liom the namenoue machine. The masteis lile shoulu
contain the auuiess ol the seconuaiy namenoue.
º Run the MapReuuce contiol sciipts liom the joLtiackei machine.
Vhen the namenoue anu joLtiackei aie on sepaiate noues, theii s|avcs liles neeu to Le
kept in sync, since each noue in the clustei shoulu iun a uatanoue anu a tasktiackei.
Environment Settings
In this section, we consiuei how to set the vaiiaLles in hadoop-cnv.sh.
By uelault, Hauoop allocates 1,000 MB (1 GB) ol memoiy to each uaemon it iuns. This
is contiolleu Ly the HADOOP_HEAPSIZE setting in hadoop-cnv.sh. In auuition, the task
tiackei launches sepaiate chilu ]VMs to iun map anu ieuuce tasks in, so we neeu to
lactoi these into the total memoiy lootpiint ol a woikei machine.
The maximum numLei ol map tasks that can iun on a tasktiackei at one time is con-
tiolleu Ly the mapred.tasktracker.map.tasks.maximum piopeity, which uelaults to two
tasks. Theie is a coiiesponuing piopeity loi ieuuce tasks, mapred.task
tracker.reduce.tasks.maximum, which also uelaults to two tasks. The tasktiackei is saiu
to have two nap s|ots anu two rcducc s|ots.
The memoiy given to each chilu ]VM iunning a task can Le changeu Ly setting the
mapred.child.java.opts piopeity. The uelault setting is -Xmx200m, which gives each task
200 MB ol memoiy. (Inciuentally, you can pioviue extia ]VM options heie, too. Foi
example, you might enaLle veiLose GC logging to ueLug GC.) The uelault conliguia-
tion theieloie uses 2,S00 MB ol memoiy loi a woikei machine (see TaLle 9-2).
Tab|c 9-2. Wor|cr nodc ncnory ca|cu|ation
JVM Default memory used (MB) Memory used for 8 processors, 400 MB per child (MB)
Datanode 1,000 1,000
Tasktracker 1,000 1,000
Tasktracker child map task 2 × 200 7 × 400
Tasktracker child reduce task 2 × 200 7 × 400
Total 2,800 7,600
Hadoop Configuration | 305
The numLei ol tasks that can Le iun simultaneously on a tasktiackei is ielateu to the
numLei ol piocessois availaLle on the machine. Because MapReuuce joLs aie noimally
I/O-Lounu, it makes sense to have moie tasks than piocessois to get Lettei
utilization. The amount ol oveisuLsciiption uepenus on the CPU utilization ol joLs
you iun, Lut a goou iule ol thumL is to have a lactoi ol Letween one anu two moie
tasks (counting Loth map anu ieuuce tasks) than piocessois.
Foi example, il you hau S piocessois anu you wanteu to iun 2 piocesses on each pio-
cessoi, then you coulu set each ol mapred.tasktracker.map.tasks.maximum anu
mapred.tasktracker.reduce.tasks.maximum to 7 (not S, since the uatanoue anu the
tasktiackei each take one slot). Il you also incieaseu the memoiy availaLle to each chilu
task to +00 MB, then the total memoiy usage woulu Le 7,600 MB (see TaLle 9-2).
Vhethei this ]ava memoiy allocation will lit into S GB ol physical memoiy uepenus
on the othei piocesses that aie iunning on the machine. Il you aie iunning Stieaming
oi Pipes piogiams, this allocation will pioLaLly Le inappiopiiate (anu the memoiy
allocateu to the chilu shoulu Le uialeu uown), since it uoesn`t allow enough memoiy
loi useis` (Stieaming oi Pipes) piocesses to iun. The thing to avoiu is piocesses Leing
swappeu out, as this leaus to seveie peiloimance uegiauation. The piecise memoiy
settings aie necessaiily veiy clustei-uepenuent anu can Le optimizeu ovei time with
expeiience gaineu liom monitoiing the memoiy usage acioss the clustei. Tools like
Ganglia (¨GangliaContext¨ on page 352) aie goou loi gatheiing this inloimation. See
¨Task memoiy limits¨ on page 316 loi moie on how to enloice task memoiy limits.
Hauoop also pioviues settings to contiol how much memoiy is useu loi MapReuuce
opeiations. These can Le set on a pei-joL Lasis anu aie coveieu in the section on ¨Shullle
anu Soit¨ on page 205.
Foi the mastei noues, each ol the namenoue, seconuaiy namenoue, anu joLtiackei
uaemons uses 1,000 MB Ly uelault, a total ol 3,000 MB.
How much memory does a namenode need?
A namenoue can eat up memoiy, since a ieleience to eveiy Llock ol eveiy lile is main-
taineu in memoiy. It`s uillicult to give a piecise loimula, since memoiy usage uepenus
on the numLei ol Llocks pei lile, the lilename length, anu the numLei ol uiiectoiies in
the lilesystem; plus it can change liom one Hauoop ielease to anothei.
The uelault ol 1,000 MB ol namenoue memoiy is noimally enough loi a lew million
liles, Lut as a iule ol thumL loi sizing puiposes you can conseivatively allow 1,000 MB
pei million Llocks ol stoiage.
Foi example, a 200 noue clustei with + TB ol uisk space pei noue, a Llock size ol 12S
MB anu a ieplication lactoi ol 3 has ioom loi aLout 2 million Llocks (oi moie): 200 ×
+,000,000 MB / (12S MB × 3). So in this case, setting the namenoue memoiy to 2,000
MB woulu Le a goou staiting point.
You can inciease the namenoue`s memoiy without changing the memoiy allocateu to
othei Hauoop uaemons Ly setting HADOOP_NAMENODE_OPTS in hadoop-cnv.sh to incluue a
306 | Chapter 9: Setting Up a Hadoop Cluster
]VM option loi setting the memoiy size. HADOOP_NAMENODE_OPTS allows you to pass extia
options to the namenoue`s ]VM. So, loi example, il using a Sun ]VM, -Xmx2000m woulu
specily that 2,000 MB ol memoiy shoulu Le allocateu to the namenoue.
Il you change the namenoue`s memoiy allocation, uon`t loiget to uo the same loi the
seconuaiy namenoue (using the HADOOP_SECONDARYNAMENODE_OPTS vaiiaLle), since its
memoiy ieguiiements aie compaiaLle to the piimaiy namenoue`s. You will pioLaLly
also want to iun the seconuaiy namenoue on a uilleient machine, in this case.
Theie aie coiiesponuing enviionment vaiiaLles loi the othei Hauoop uaemons, so you
can customize theii memoiy allocations, il uesiieu. See hadoop-cnv.sh loi uetails.
The location ol the ]ava implementation to use is ueteimineu Ly the JAVA_HOME setting
in hadoop-cnv.sh oi liom the JAVA_HOME shell enviionment vaiiaLle, il not set in hadoop-
cnv.sh. It`s a goou iuea to set the value in hadoop-cnv.sh, so that it is cleaily uelineu in
one place anu to ensuie that the whole clustei is using the same veision ol ]ava.
System logfiles
System logliles piouuceu Ly Hauoop aie stoieu in $HADOOP_INSTALL/logs Ly uelault.
This can Le changeu using the HADOOP_LOG_DIR setting in hadoop-cnv.sh. It`s a goou iuea
to change this so that logliles aie kept out ol the uiiectoiy that Hauoop is installeu in,
since this keeps logliles in one place even altei the installation uiiectoiy changes altei
an upgiaue. A common choice is /var/|og/hadoop, set Ly incluuing the lollowing line in
export HADOOP_LOG_DIR=/var/log/hadoop
The log uiiectoiy will Le cieateu il it uoesn`t alieauy exist (il not, conliim that the
Hauoop usei has peimission to cieate it). Each Hauoop uaemon iunning on a machine
piouuces two logliles. The liist is the log output wiitten via log+j. This lile, which enus
in .|og, shoulu Le the liist poit ol call when uiagnosing pioLlems, since most application
log messages aie wiitten heie. The stanuaiu Hauoop log+j conliguiation uses a Daily
Rolling File Appenuei to iotate logliles. Olu logliles aie nevei ueleteu, so you shoulu
aiiange loi them to Le peiiouically ueleteu oi aichiveu, so as to not iun out ol uisk
space on the local noue.
The seconu loglile is the comLineu stanuaiu output anu stanuaiu eiioi log. This loglile,
which enus in .out, usually contains little oi no output, since Hauoop uses log+j loi
logging. It is only iotateu when the uaemon is iestaiteu, anu only the last live logs aie
ietaineu. Olu logliles aie sullixeu with a numLei Letween 1 anu 5, with 5 Leing the
oluest lile.
Loglile names (ol Loth types) aie a comLination ol the name ol the usei iunning the
uaemon, the uaemon name, anu the machine hostname. Foi example, hadoop-ton-
datanodc-sturgcs.|oca|.|og.2008-07-01 is the name ol a loglile altei it has Leen iotateu.
Hadoop Configuration | 307
This naming stiuctuie makes it possiLle to aichive logs liom all machines in the clustei
in a single uiiectoiy, il neeueu, since the lilenames aie unigue.
The useiname in the loglile name is actually the uelault loi the HADOOP_IDENT_STRING
setting in hadoop-cnv.sh. Il you wish to give the Hauoop instance a uilleient iuentity
loi the puiposes ol naming the logliles, change HADOOP_IDENT_STRING to Le the iuentiliei
you want.
SSH settings
The contiol sciipts allow you to iun commanus on (iemote) woikei noues liom the
mastei noue using SSH. It can Le uselul to customize the SSH settings, loi vaiious
ieasons. Foi example, you may want to ieuuce the connection timeout (using the
ConnectTimeout option) so the contiol sciipts uon`t hang aiounu waiting to see whethei
a ueau noue is going to iesponu. OLviously, this can Le taken too lai. Il the timeout is
too low, then Lusy noues will Le skippeu, which is Lau.
Anothei uselul SSH setting is StrictHostKeyChecking, which can Le set to no to auto-
matically auu new host keys to the known hosts liles. The uelault, ask, is to piompt
the usei to conliim they have veiilieu the key lingeipiint, which is not a suitaLle setting
in a laige clustei enviionment.
To pass extia options to SSH, ueline the HADOOP_SSH_OPTS enviionment vaiiaLle in
hadoop-cnv.sh. See the ssh anu ssh_config manual pages loi moie SSH settings.
The Hauoop contiol sciipts can uistiiLute conliguiation liles to all noues ol the clustei
using isync. This is not enaLleu Ly uelault, Lut Ly uelining the HADOOP_MASTER setting
in hadoop-cnv.sh, woikei uaemons will isync the tiee iooteu at HADOOP_MASTER to the
local noue`s HADOOP_INSTALL whenevei the uaemon staits up.
Vhat il you have two masteis÷a namenoue anu a joLtiackei on sepaiate machines?
You can pick one as the souice anu the othei can isync liom it, along with all the
woikeis. In lact, you coulu use any machine, even one outsiue the Hauoop clustei, to
isync liom.
Because HADOOP_MASTER is unset Ly uelault, theie is a Lootstiapping pioLlem: how uo
we make suie hadoop-cnv.sh with HADOOP_MASTER set is piesent on woikei noues? Foi
small clusteis, it is easy to wiite a small sciipt to copy hadoop-cnv.sh liom the mastei
to all ol the woikei noues. Foi laigei clusteis, tools like dsh can uo the copies in paiallel.
Alteinatively, a suitaLle hadoop-cnv.sh can Le cieateu as a pait ol the automateu in-
stallation sciipt (such as Kickstait).
Vhen staiting a laige clustei with isyncing enaLleu, the woikei noues can oveiwhelm
the mastei noue with isync ieguests since the woikeis stait at aiounu the same time.
To avoiu this, set the HADOOP_SLAVE_SLEEP setting to a small numLei ol seconus, such
+. Foi moie uiscussion on the secuiity implications ol SSH Host Keys, consult the aiticle ¨SSH Host Key
Piotection¨ Ly Biian Hatch at http://www.sccurityjocus.con/injocus/180ó.
308 | Chapter 9: Setting Up a Hadoop Cluster
as 0.1, loi one-tenth ol a seconu. Vhen iunning commanus on all noues ol the clustei,
the mastei will sleep loi this peiiou Letween invoking the commanu on each woikei
machine in tuin.
Important Hadoop Daemon Properties
Hauoop has a Lewilueiing numLei ol conliguiation piopeities. In this section, we
auuiess the ones that you neeu to ueline (oi at least unueistanu why the uelault is
appiopiiate) loi any ieal-woilu woiking clustei. These piopeities aie set in the Hauoop
site liles: corc-sitc.xn|, hdjs-sitc.xn|, anu naprcd-sitc.xn|. Typical examples ol these
liles aie shown in Example 9-1, Example 9-2, anu Example 9-3. Notice that most piop-
eities aie maikeu as linal, in oiuei to pievent them liom Leing oveiiiuuen Ly joL con-
liguiations. You can leain moie aLout how to wiite Hauoop`s conliguiation liles in
¨The Conliguiation API¨ on page 1+6.
Exanp|c 9-1. A typica| corc-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- core-site.xml -->
Exanp|c 9-2. A typica| hdjs-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- hdfs-site.xml -->

Hadoop Configuration | 309
Exanp|c 9-3. A typica| naprcd-sitc.xn| conjiguration ji|c
<?xml version="1.0"?>
<!-- mapred-site.xml -->





<!-- Not marked as final so jobs can include JVM debugging options -->
To iun HDFS, you neeu to uesignate one machine as a namenoue. In this case, the
piopeity fs.default.name is an HDFS lilesystem URI, whose host is the namenoue`s
hostname oi IP auuiess, anu poit is the poit that the namenoue will listen on loi RPCs.
Il no poit is specilieu, the uelault ol S020 is useu.
The nastcrs lile that is useu Ly the contiol sciipts is not useu Ly the
HDFS (oi MapReuuce) uaemons to ueteimine hostnames. In lact, Le-
cause the nastcrs lile is only useu Ly the sciipts, you can ignoie it il you
uon`t use them.
310 | Chapter 9: Setting Up a Hadoop Cluster
The fs.default.name piopeity also uouLles as specilying the uelault lilesystem. The
uelault lilesystem is useu to iesolve ielative paths, which aie hanuy to use since they
save typing (anu avoiu haiucouing knowleuge ol a paiticulai namenoue`s auuiess). Foi
example, with the uelault lilesystem uelineu in Example 9-1, the ielative URI /a/b is
iesolveu to hdjs://nancnodc/a/b.
Il you aie iunning HDFS, the lact that fs.default.name is useu to specily
Loth the HDFS namenoue and the uelault lilesystem means HDFS has
to Le the uelault lilesystem in the seivei conliguiation. Beai in minu,
howevei, that it is possiLle to specily a uilleient lilesystem as the uelault
in the client conliguiation, loi convenience.
Foi example, il you use Loth HDFS anu S3 lilesystems, then you have
a choice ol specilying eithei as the uelault in the client conliguiation,
which allows you to ielei to the uelault with a ielative URI anu the othei
with an aLsolute URI.
Theie aie a lew othei conliguiation piopeities you shoulu set loi HDFS: those that set
the stoiage uiiectoiies loi the namenoue anu loi uatanoues. The piopeity
dfs.name.dir specilies a list ol uiiectoiies wheie the namenoue stoies peisistent lile-
system metauata (the euit log anu the lilesystem image). A copy ol each ol the metauata
liles is stoieu in each uiiectoiy loi ieuunuancy. It`s common to conliguie
dfs.name.dir so that the namenoue metauata is wiitten to one oi two local uisks, anu
a iemote uisk, such as an NFS-mounteu uiiectoiy. Such a setup guaius against lailuie
ol a local uisk anu lailuie ol the entiie namenoue, since in Loth cases the liles can Le
iecoveieu anu useu to stait a new namenoue. (The seconuaiy namenoue takes only
peiiouic checkpoints ol the namenoue, so it uoes not pioviue an up-to-uate Lackup ol
the namenoue.)
You shoulu also set the dfs.data.dir piopeity, which specilies a list ol uiiectoiies loi
a uatanoue to stoie its Llocks. Unlike the namenoue, which uses multiple uiiectoiies
loi ieuunuancy, a uatanoue iounu-ioLins wiites Letween its stoiage uiiectoiies, so loi
peiloimance you shoulu specily a stoiage uiiectoiy loi each local uisk. Reau peiloi-
mance also Lenelits liom having multiple uisks loi stoiage, Lecause Llocks will Le
spieau acioss them, anu concuiient ieaus loi uistinct Llocks will Le coiiesponuingly
spieau acioss uisks.
Foi maximum peiloimance, you shoulu mount stoiage uisks with the
noatime option. This setting means that last accesseu time inloimation
is not wiitten on lile ieaus, which gives signilicant peiloimance gains.
Finally, you shoulu conliguie wheie the seconuaiy namenoue stoies its checkpoints ol
the lilesystem. The fs.checkpoint.dir piopeity specilies a list ol uiiectoiies wheie the
checkpoints aie kept. Like the stoiage uiiectoiies loi the namenoue, which keep ie-
Hadoop Configuration | 311
uunuant copies ol the namenoue metauata, the checkpointeu lilesystem image is stoieu
in each checkpoint uiiectoiy loi ieuunuancy.
TaLle 9-3 summaiizes the impoitant conliguiation piopeities loi HDFS.
Tab|c 9-3. |nportant HDIS dacnon propcrtics
Property name Type Default value Description
fs.default.name URI file:/// The default filesystem. The URI defines
the hostname and port that the name-
node’s RPC server runs on. The default
port is 8020. This property is set in core-
dfs.name.dir comma-separated
directory names
The list of directories where the name-
node stores its persistent metadata.
The namenode stores a copy of the
metadata in each directory in the list.
dfs.data.dir comma-separated
directory names
A list of directories where the datanode
stores blocks. Each block is stored in
only one of these directories.
fs.checkpoint.dir comma-separated
directory names
A list of directories where the
secondary namenode stores check-
points. It stores a copy of the checkpoint
in each directory in the list.
Note that the stoiage uiiectoiies loi HDFS aie unuei Hauoop`s tempo-
iaiy uiiectoiy Ly uelault (the hadoop.tmp.dir piopeity, whose uelault
is /tmp/hadoop-${user.name}). Theieloie, it is ciitical that these piopei-
ties aie set so that uata is not lost Ly the system cleaiing out tempoiaiy
To iun MapReuuce, you neeu to uesignate one machine as a joLtiackei, which on small
clusteis may Le the same machine as the namenoue. To uo this, set the
mapred.job.tracker piopeity to the hostname oi IP auuiess anu poit that the joLtiackei
will listen on. Note that this piopeity is not a URI, Lut a host-poit paii, sepaiateu Ly
a colon. The poit numLei S021 is a common choice.
Duiing a MapReuuce joL, inteimeuiate uata anu woiking liles aie wiitten to tempoiaiy
local liles. Since this uata incluues the potentially veiy laige output ol map tasks, you
neeu to ensuie that the mapred.local.dir piopeity, which contiols the location ol local
tempoiaiy stoiage, is conliguieu to use uisk paititions that aie laige enough. The
mapred.local.dir piopeity takes a comma-sepaiateu list ol uiiectoiy names, anu you
shoulu use all availaLle local uisks to spieau uisk I/O. Typically, you will use the same
uisks anu paititions (Lut uilleient uiiectoiies) loi MapReuuce tempoiaiy uata as you
312 | Chapter 9: Setting Up a Hadoop Cluster
use loi uatanoue Llock stoiage, as goveineu Ly the dfs.data.dir piopeity, uiscusseu
MapReuuce uses a uistiiLuteu lilesystem to shaie liles (such as the joL ]AR lile) with
the tasktiackeis that iun the MapReuuce tasks. The mapred.system.dir piopeity is useu
to specily a uiiectoiy wheie these liles can Le stoieu. This uiiectoiy is iesolveu ielative
to the uelault lilesystem (conliguieu in fs.default.name), which is usually HDFS.
Finally, you shoulu set the mapred.tasktracker.map.tasks.maximum anu mapred.task
tracker.reduce.tasks.maximum piopeities to iellect the numLei ol availaLle coies on
the tasktiackei machines anu mapred.child.java.opts to iellect the amount ol memoiy
availaLle loi the tasktiackei chilu ]VMs. See the uiscussion in ¨Memoiy¨
on page 305.
TaLle 9-+ summaiizes the impoitant conliguiation piopeities loi MapReuuce.
Tab|c 9-1. |nportant MapRcducc dacnon propcrtics
Property name Type Default value Description
mapred.job.tracker hostname and port local The hostname and port that the job-
tracker’s RPC server runs on. If set to
the default value of local, then the
jobtracker is run in-process on de-
mand when you run a MapReduce job
(you don’t need to start the jobtracker
in this case, and in fact you will get
an error if you try to start it in this
mapred.local.dir comma-separated
directory names
A list of directories where MapReduce
stores intermediate data for jobs. The
data is cleared out when the job ends.
mapred.system.dir URI ${hadoop.tmp.dir}
The directory relative to
fs.default.name where shared
files are stored, during a job run.
int 2 The number of map tasks that may be
run on a tasktracker at any one time.
int 2 The number of reduce tasks that may
be run on a tasktracker at any one
mapred.child.java.opts String -Xmx200m The JVM options used to launch the
tasktracker child process that runs
map and reduce tasks. This property
can be set on a per-job basis, which
can be useful for setting JVM proper-
ties for debugging, for example.
Hadoop Configuration | 313
Property name Type Default value Description
String -Xmx200m The JVM options used for the child
process that runs map tasks. From
String -Xmx200m The JVM options used for the child
process that runs reduce tasks. From
Hadoop Daemon Addresses and Ports
Hauoop uaemons geneially iun Loth an RPC seivei (TaLle 9-5) loi communication
Letween uaemons anu an HTTP seivei to pioviue weL pages loi human consumption
(TaLle 9-6). Each seivei is conliguieu Ly setting the netwoik auuiess anu poit numLei
to listen on. By specilying the netwoik auuiess as, Hauoop will Linu to all
auuiesses on the machine. Alteinatively, you can specily a single auuiess to Linu to. A
poit numLei ol 0 instiucts the seivei to stait on a liee poit: this is geneially uiscouiageu,
since it is incompatiLle with setting clustei-wiue liiewall policies.
Tab|c 9-5. RPC scrvcr propcrtics
Property name Default value Description
fs.default.name file:/// When set to an HDFS URI, this property determines
the namenode’s RPC server address and port. The
default port is 8020 if not specified.
dfs.datanode.ipc.address The datanode’s RPC server address and port.
mapred.job.tracker local When set to a hostname and port, this property
specifies the jobtracker’s RPC server address and
port. A commonly used port is 8021.
mapred.task.tracker.report.address The tasktracker’s RPC server address and port. This
is used by the tasktracker’s child JVM to commu-
nicate with the tasktracker. Using any free port is
acceptable in this case, as the server only binds to
the loopback address. You should change this
setting only if the machine has no loopback
In auuition to an RPC seivei, uatanoues iun a TCP/IP seivei loi Llock tiansleis. The
seivei auuiess anu poit is set Ly the dfs.datanode.address piopeity, anu has a uelault
value ol
Tab|c 9-ó. HTTP scrvcr propcrtics
Property name Default value Description
mapred.job.tracker.http.address The jobtracker’s HTTP server address and port.
mapred.task.tracker.http.address The tasktracker’s HTTP server address and port.
314 | Chapter 9: Setting Up a Hadoop Cluster
Property name Default value Description
dfs.http.address The namenode’s HTTP server address and port.
dfs.datanode.http.address The datanode’s HTTP server address and port.
dfs.secondary.http.address The secondary namenode’s HTTP server address and
Theie aie also settings loi contiolling which netwoik inteilaces the uatanoues anu
tasktiackeis iepoit as theii IP auuiesses (loi HTTP anu RPC seiveis). The ielevant
piopeities aie dfs.datanode.dns.interface anu mapred.tasktracker.dns.interface,
Loth ol which aie set to default, which will use the uelault netwoik inteilace. You can
set this explicitly to iepoit the auuiess ol a paiticulai inteilace (eth0, loi example).
Other Hadoop Properties
This section uiscusses some othei piopeities that you might consiuei setting.
Cluster membership
To aiu the auuition anu iemoval ol noues in the lutuie, you can specily a lile containing
a list ol authoiizeu machines that may join the clustei as uatanoues oi tasktiackeis.
The lile is specilieu using the dfs.hosts (loi uatanoues) anu mapred.hosts (loi
tasktiackeis) piopeities, as well as the coiiesponuing dfs.hosts.exclude anu
mapred.hosts.exclude liles useu loi uecommissioning. See ¨Commissioning anu De-
commissioning Noues¨ on page 357 loi luithei uiscussion.
Buffer size
Hauoop uses a Lullei size ol + KB (+,096 Lytes) loi its I/O opeiations. This is a con-
seivative setting, anu with mouein haiuwaie anu opeiating systems, you will likely see
peiloimance Lenelits Ly incieasing it; 12S KB (131,072 Lytes) is a common choice. Set
this using the io.file.buffer.size piopeity in corc-sitc.xn|.
HDFS block size
The HDFS Llock size is 6+ MB Ly uelault, Lut many clusteis use 12S MB (13+,217,72S
Lytes) oi even 256 MB (26S,+35,+56 Lytes) to ease memoiy piessuie on the namenoue
anu to give mappeis moie uata to woik on. Set this using the dfs.block.size piopeity
in hdjs-sitc.xn|.
Reserved storage space
By uelault, uatanoues will tiy to use all ol the space availaLle in theii stoiage uiiectoiies.
Il you want to ieseive some space on the stoiage volumes loi non-HDFS use, then you
can set dfs.datanode.du.reserved to the amount, in Lytes, ol space to ieseive.
Hadoop Configuration | 315
Hauoop lilesystems have a tiash lacility, in which ueleteu liles aie not actually ueleteu,
Lut iathei aie moveu to a tiash loluei, wheie they iemain loi a minimum peiiou Leloie
Leing peimanently ueleteu Ly the system. The minimum peiiou in minutes that a lile
will iemain in the tiash is set using the fs.trash.interval conliguiation piopeity in
corc-sitc.xn|. By uelault, the tiash inteival is zeio, which uisaLles tiash.
Like in many opeiating systems, Hauoop`s tiash lacility is a usei-level leatuie, meaning
that only liles that aie ueleteu using the lilesystem shell aie put in the tiash. Files ueleteu
piogiammatically aie ueleteu immeuiately. It is possiLle to use the tiash piogiammat-
ically, howevei, Ly constiucting a Trash instance, then calling its moveToTrash() methou
with the Path ol the lile intenueu loi ueletion. The methou ietuins a value inuicating
success; a value ol false means eithei that tiash is not enaLleu oi that the lile is alieauy
in the tiash.
Vhen tiash is enaLleu, each usei has hei own tiash uiiectoiy calleu .Trash in hei home
uiiectoiy. File iecoveiy is simple: you look loi the lile in a suLuiiectoiy ol .Trash anu
move it out ol the tiash suLtiee.
HDFS will automatically uelete liles in tiash lolueis, Lut othei lilesystems will not, so
you have to aiiange loi this to Le uone peiiouically. You can cxpungc the tiash, which
will uelete liles that have Leen in the tiash longei than theii minimum peiiou, using
the lilesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() methou that has the same ellect.
Job scheduler
Paiticulaily in a multiusei MapReuuce setting, consiuei changing the uelault FIFO joL
scheuulei to one ol the moie lully leatuieu alteinatives. See ¨]oL Scheuul-
ing¨ on page 20+.
Reduce slow start
By uelault, scheuuleis wait until 5º ol the map tasks in a joL have completeu Leloie
scheuuling ieuuce tasks loi the same joL. Foi laige joLs this can cause pioLlems with
clustei utilization, since they take up ieuuce slots while waiting loi the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a highei value, such as
0.80 (S0º), can help impiove thioughput.
Task memory limits
On a shaieu clustei, it shoulun`t Le possiLle loi one usei`s eiiant MapReuuce piogiam
to Liing uown noues in the clustei. This can happen il the map oi ieuuce task has a
memoiy leak, loi example, Lecause the machine on which the tasktiackei is iunning
will iun out ol memoiy anu may allect the othei iunning piocesses.
316 | Chapter 9: Setting Up a Hadoop Cluster
Oi consiuei the case wheie a usei sets mapred.child.java.opts to a laige value anu
causes memoiy piessuie on othei iunning tasks, causing them to swap. Maiking this
piopeity as linal on the clustei woulu pievent it Leing changeu Ly useis in theii joLs,
Lut theie aie legitimate ieasons to allow some joLs to use moie memoiy, so this is not
always an acceptaLle solution. Fuitheimoie, even locking uown
mapred.child.java.opts uoes not solve the pioLlem, since tasks can spawn new pio-
cesses which aie not constiaineu in theii memoiy usage. Stieaming anu Pipes joLs uo
exactly that, loi example.
To pievent cases like these, some way ol enloicing a limit on a task`s memoiy usage is
neeueu. Hauoop pioviues two mechanisms loi this. The simplest is via the Linux
u|init commanu, which can Le uone at the opeiating system level (in the |inits.conj
lile, typically lounu in /ctc/sccurity), oi Ly setting mapred.child.ulimit in the Hauoop
conliguiation. The value is specilieu in kiloLytes, anu shoulu Le comloitaLly laigei
than the memoiy ol the ]VM set Ly mapred.child.java.opts; otheiwise, the chilu ]VM
might not stait.
The seconu mechanism is Hauoop`s tas| ncnory nonitoring leatuie.
The iuea is that
an auminstiatoi sets a iange ol alloweu viitual memoiy limits loi tasks on the clustei,
anu useis specily the maximum memoiy ieguiiements loi theii joLs in the joL conlig-
uiation. Il a usei uoesn`t set memoiy ieguiiements loi theii joL, then the uelaults aie
useu (mapred.job.map.memory.mb anu mapred.job.reduce.memory.mb).
This appioach has a couple ol auvantages ovei the u|init appioach. Fiist, it enloices
the memoiy usage ol the whole task piocess tiee, incluuing spawneu piocesses. Seconu,
it enaLles memoiy-awaie scheuuling, wheie tasks aie scheuuleu on tasktiackeis which
have enough liee memoiy to iun them. The Capacity Scheuulei, loi example, will ac-
count loi slot usage Laseu on the memoiy settings, so that il a joL`s mapred.job.map.mem
ory.mb setting exceeus mapred.cluster.map.memory.mb then the scheuulei will allocate
moie than one slot on a tasktiackei to iun each map task loi that joL.
To enaLle task memoiy monitoiing you neeu to set all six ol the piopeities in Ta-
Lle 9-7. The uelault values aie all -1, which means the leatuie is uisaLleu.
Tab|c 9-7. MapRcducc tas| ncnory nonitoring propcrtics
Property name Type Default
int -1 The amount of virtual memory, in MB, that defines a map
slot. Map tasks that require more than this amount of
memory will use more than one map slot.
int -1 The amount of virtual memory, in MB, that defines a reduce
slot. Reduce tasks that require more than this amount of
memory will use more than one reduce slot.
5. YARN uses a uilleient memoiy mouel to the one uesciiLeu heie, anu the conliguiation options aie
uilleient. See ¨Memoiy¨ on page 321.
Hadoop Configuration | 317
Property name Type Default
mapred.job.map.memory.mb int -1 The amount of virtual memory, in MB, that a map task
requires to run. If a map task exceeds this limit it may be
terminated and marked as failed.
int -1 The amount of virtual memory, in MB, that a reduce task
requires to run. If a reduce task exceeds this limit it may be
terminated and marked as failed.
int -1 The maximum limit that users can set
mapred.job.map.memory.mb to.
int -1 The maximum limit that users can set
mapred.job.reduce.memory.mb to.
User Account Creation
Once you have a Hauoop clustei up anu iunning, you neeu to give useis access to it.
This involves cieating a home uiiectoiy loi each usei anu setting owneiship peimissions
on it:
% hadoop fs -mkdir /user/username
% hadoop fs -chown username:username /user/username
This is a goou time to set space limits on the uiiectoiy. The lollowing sets a 1 TB limit
on the given usei uiiectoiy:
% hadoop dfsadmin -setSpaceQuota 1t /user/username
YARN Configuration
YARN is the next-geneiation aichitectuie loi iunning MapReuuce (anu is uesciiLeu in
¨YARN (MapReuuce 2)¨ on page 19+). It has a uilleient set ol uaemons anu conligu-
iation options to classic MapReuuce (also calleu MapReuuce 1), anu in this section we
shall look at these uilleiences anu how to iun MapReuuce on YARN.
Unuei YARN you no longei iun a joLtiackei oi tasktiackeis. Insteau, theie is a single
iesouice managei iunning on the same machine as the HDFS namenoue (loi small
clusteis) oi on a ueuicateu machine, anu noue manageis iunning on each woikei noue
in the clustei.
The YARN start-a||.sh sciipt (in the bin uiiectoiy) staits the YARN uaemons in the
clustei. This sciipt will stait a iesouice managei (on the machine the sciipt is iun on),
anu a noue managei on each machine listeu in the s|avcs lile.
YARN also has a joL histoiy seivei uaemon that pioviues useis with uetails ol past joL
iuns, anu a weL app pioxy seivei loi pioviuing a secuie way loi useis to access the UI
pioviueu Ly YARN applications. In the case ol MapReuuce, the weL UI seiveu Ly the
pioxy pioviues inloimation aLout the cuiient joL you aie iunning, similai to the one
318 | Chapter 9: Setting Up a Hadoop Cluster
uesciiLeu in ¨The MapReuuce VeL UI¨ on page 16+. By uelault the weL app pioxy
seivei iuns in the same piocess as the iesouice managei, Lut it may Le conliguieu to
iun as a stanualone uaemon.
YARN has its own set ol conliguiation liles listeu in TaLle 9-S, these aie useu in auuition
to those in TaLle 9-1.
Tab|c 9-8. YARN conjiguration ji|cs
Filename Format Description
yarn-env.sh Bash script Environment variables that are used in the scripts to run YARN.
yarn-site.xml Hadoop configuration XML Configuration settings for YARN daemons: the resource manager, the job history
server, the webapp proxy server, and the node managers.
Important YARN Daemon Properties
Vhen iunning MapReuuce on YARN the naprcd-sitc.xn| lile is still useu loi geneial
MapReuuce piopeities, although the joLtiackei anu tasktiackei-ielateu piopeities aie
not useu. None ol the piopeities in TaLle 9-+ aie applicaLle to YARN, except loi
mapred.child.java.opts (anu the ielateu piopeities mapreduce.map.java.opts anu map
reduce.reduce.java.opts which apply only to map oi ieuuce tasks, iespectively). The
]VM options specilieu in this way aie useu to launch the YARN chilu piocess that iuns
map oi ieuuce tasks.
The conliguiation liles in Example 9-+ show some ol the impoitant conliguiation
piopeities loi iunning MapReuuce on YARN.
Exanp|c 9-1. An cxanp|c sct oj sitc conjiguration ji|cs jor running MapRcducc on YARN
<?xml version="1.0"?>
<!-- mapred-site.xml -->
<!-- Not marked as final so jobs can include JVM debugging options -->
<?xml version="1.0"?>
<!-- yarn-site.xml -->
YARN Configuration | 319
The YARN iesouice managei auuiess is contiolleu via yarn.resourceman
ager.address, which takes the loim ol a host-poit paii. In a client conliguiation this
piopeity is useu to connect to the iesouice managei (using RPC), anu in auuition the
mapreduce.framework.name piopeity must Le set to yarn loi the client to use YARN
iathei than the local joL iunnei.
Although YARN uoes not honoi mapred.local.dir, it has an eguivalent piopeity calleu
yarn.nodemanager.local-dirs, which allows you to specily which local uisks to stoie
inteimeuiate uata on. It is specilieu Ly a comma-sepaiateu list ol local uiiectoiy paths,
which aie useu in a iounu-ioLin lashion.
YARN uoesn`t have tasktiackeis to seive map outputs to ieuuce tasks, so loi this lunc-
tion it ielies on shullle hanuleis, which aie long-iunning auxiliaiy seivices iunning in
noue manageis. Since YARN is a geneial-puipose seivice the shullle hanuleis neeu to
Le explictly enaLleu in the yarn-sitc.xn| Ly setting the yarn.nodemanager.aux-serv
ices piopeity to mapreduce.shuffle.
TaLle 9-9 summaiizes the impoitant conliguiation piopeities loi YARN.
Tab|c 9-9. |nportant YARN dacnon propcrtics
Property name Type Default value Description
hostname and port The hostname and port that the resource
manager’s RPC server runs on.
directory names
A list of directories where node manag-
ers allow containers to store intermedi-
ate data. The data is cleared out when
the application ends.
service names
A list of auxiliary services run by the node
manager. A service is implemented by
the class defined by the property
ices.service-name.class. By
default no auxiliary services are speci-
int 8192 The amount of physical memory (in MB)
which may be allocated to containers
being run by the node manager.
320 | Chapter 9: Setting Up a Hadoop Cluster
Property name Type Default value Description
float 2.1 The ratio of virtual to physical memory
for containers. Virtual memory usage
may exceed the allocation by this
YARN tieats memoiy in a moie line-giaineu mannei than the slot-Laseu mouel useu
in the classic implementation ol MapReuuce. Rathei than specilying a lixeu maximum
numLei ol map anu ieuuce slots that may iun on a tasktiackei noue at once, YARN
allows applications to ieguest an aiLitiaiy amount ol memoiy (within limits) loi a task.
In the YARN mouel, noue manageis allocate memoiy liom a pool, so the numLei ol
tasks that aie iunning on a paiticulai noue uepenus on the sum ol theii memoiy ie-
guiiements, anu not simply on a lixeu numLei ol slots.
The slot-Laseu mouel can leau to clustei unuei-utilization, since the piopoition ol map
slots to ieuuce slots is lixeu as a clustei-wiue conliguiation. Howevei, the numLei ol
map veisus ieuuce slots that aie in dcnand changes ovei time: at the Leginning ol a joL
only map slots aie neeueu, while at the enu ol the joL only ieuuce slots aie neeueu. On
laigei clusteis with many concuiient joLs the vaiiation in uemanu loi a paiticulai type
ol slot may Le less pionounceu, Lut theie is still wastage. YARN avoius this pioLlem
Ly not uistinguishing Letween the two types ol slot.
The consiueiations loi how much memoiy to ueuicate to a noue managei loi iunning
containeis aie similai to the those uiscusseu in ¨Memoiy¨ on page 305. Each Hauoop
uaemon uses 1,000 MB, so loi a uatanoue anu a noue managei the total is 2,000 MB.
Set asiue enough loi othei piocesses that aie iunning on the machine, anu the iemainuei
can Le ueuicateu to the noue managei`s containeis, Ly setting the conliguiation piop-
eity yarn.nodemanager.resource.memory-mb to the total allocation in MB. (The uelault
is S,192 MB.)
The next step is to ueteimine how to set memoiy options loi inuiviuual joLs. Theie aie
two contiols: mapred.child.java.opts which allows you to set the ]VM heap size ol the
map oi ieuuce task; anu mapreduce.map.memory.mb (oi mapreduce.reduce.memory.mb)
which is useu to specily how much memoiy you neeu loi map (oi ieuuce) task con-
taineis. The lattei setting is useu Ly the scheuulei when negotiating loi iesouices in the
clustei, anu Ly the noue managei, which iuns anu monitois the task containeis.
Foi example, suppose that mapred.child.java.opts is set to -Xmx800m, anu mapre
duce.map.memory.mb is lelt at its uelault value ol 1,02+ MB. Vhen a map task is iun, the
noue managei will allocate a 1,02+ MB containei (uecieasing the size ol its pool Ly that
amount loi the uuiation ol the task) anu launch the task ]VM conliguieu with a S00
MB maximum heap size. Note that the ]VM piocess will have a laigei memoiy lootpiint
than the heap size, anu the oveiheau will uepenu on such things as the native liLiaiies
that aie in use, the size ol the peimanent geneiation space, anu so on. The impoitant
YARN Configuration | 321
thing is that the physical memoiy useu Ly the ]VM piocess, incluuing any piocesses
that it spawns, such as Stieaming oi Pipes piocesses, uoes not exceeu its allocation
(1,02+ MB). Il a containei uses moie memoiy than it has Leen allocateu than it may Le
teiminateu Ly the noue managei anu maikeu as laileu.
Scheuuleis may impose a minimum oi maximum on memoiy allocations. Foi example,
loi the capacity scheuulei the uelault minimum is 102+ MB (set Ly yarn.schedu
ler.capacity.minimum-allocation-mb), anu the uelault maximum is 102+0 MB (set Ly
Theie aie also viitual memoiy constiaints that a containei must meet. Il a containei`s
viitual memoiy usage exeeus a given multiple ol the allocateu physical memoiy, then
the noue managei may teiminate the piocess. The multiple is expiesseu Ly the
yarn.nodemanager.vmem-pmem-ratio piopeity, which uelaults to 2.1. In the example
aLove, the viitual memoiy thiesholu aLove which the task may Le teiminateu is 2,150
MB, which is 2.1 × 1,02+ MB.
Vhen conliguiing memoiy paiameteis it`s veiy uselul to Le aLle to monitoi a task`s
actual memoiy usage uuiing a joL iun, anu this is possiLle via MapReuuce task coun-
TED_HEAP_BYTES (uesciiLeu in TaLle S-2) pioviue snapshot values ol memoiy usage anu
aie theieloie suitaLle loi oLseivation uuiing the couise ol a task attempt.
YARN Daemon Addresses and Ports
YARN uaemons iun one oi moie RPC seiveis anu HTTP seiveis, uetails ol which aie
coveieu in TaLle 9-10 anu TaLle 9-11.
Tab|c 9-10. YARN RPC scrvcr propcrtics
Property name Default value Description
ager.address The resource manager’s RPC server address and port. This is used
by the client (typically outside the cluster) to communicate with
the resource manager.
ager.admin.address The resource manager’s admin RPC server address and port. This is
used by the admin client (invoked with yarn rmadmin, typically
run outside the cluster) to communicate with the resource manager.
ager.scheduler.address The resource manager scheduler’s RPC server address and port. This
is used by (in-cluster) application masters to communicate with the
resource manager.
tracker.address The resource manager resource tracker’s RPC server address and
port. This is used by the (in-cluster) node managers to communicate
with the resource manager.
ager.address The node manager’s RPC server address and port. This is used by
(in-cluster) application masters to communicate with node man-
322 | Chapter 9: Setting Up a Hadoop Cluster
Property name Default value Description
izer.address The node manager localizer’s RPC server address and port.
tory.address The job history server’s RPC server address and port. This is used by
the client (typically outside the cluster) to query job history. This
property is set in mapred-site.xml.
Tab|c 9-11. YARN HTTP scrvcr propcrtics
Property name Default value Description
ager.webapp.address The resource manager’s HTTP server address and port.
ager.webapp.address The node manager’s HTTP server address and port.
yarn.web-proxy.address The web app proxy server’s HTTP server address and port. If not set
(the default) then the web app proxy server will run in the resource
manager process.
tory.webapp.address The job history server’s HTTP server address and port. This property
is set in mapred-site.xml.
mapreduce.shuffle.port 8080 The shuffle handler’s HTTP port number. This is used for serving
map outputs, and is not a user-accessible web UI. This property is
set in mapred-site.xml.
Eaily veisions ol Hauoop assumeu that HDFS anu MapReuuce clusteis woulu Le useu
Ly a gioup ol coopeiating useis within a secuie enviionment. The measuies loi ie-
stiicting access weie uesigneu to pievent acciuental uata loss, iathei than to pievent
unauthoiizeu access to uata. Foi example, the lile peimissions system in HDFS pievents
one usei liom acciuentally wiping out the whole lilesystem liom a Lug in a piogiam,
oi Ly mistakenly typing hadoop fs -rmr /, Lut it uoesn`t pievent a malicious usei liom
assuming ioot`s iuentity (see ¨Setting Usei Iuentity¨ on page 150) to access oi uelete
any uata in the clustei.
In secuiity pailance, what was missing was a secuie authcntication mechanism to assuie
Hauoop that the usei seeking to peiloim an opeiation on the clustei is who they claim
to Le anu theieloie tiusteu. HDFS lile peimissions pioviue only a mechanism loi au-
thorization, which contiols what a paiticulai usei can uo to a paiticulai lile. Foi
example, a lile may only Le ieauaLle Ly a gioup ol useis, so anyone not in that gioup
is not authoiizeu to ieau it. Howevei, authoiization is not enough Ly itsell, since the
system is still open to aLuse via spooling Ly a malicious usei who can gain netwoik
access to the clustei.
Security | 323
It`s common to iestiict access to uata that contains peisonally iuentiliaLle inloimation
(such as an enu usei`s lull name oi IP auuiess) to a small set ol useis (ol the clustei)
within the oiganization, who aie authoiizeu to access such inloimation. Less sensitive
(oi anonymizeu) uata may Le maue availaLle to a laigei set ol useis. It is convenient to
host a mix ol uatasets with uilleient secuiity levels on the same clustei (not least Lecause
it means the uatasets with lowei secuiity levels can Le shaieu). Howevei, to meet ieg-
ulatoiy ieguiiements loi uata piotection, secuie authentication must Le in place loi
shaieu clusteis.
This is the situation that Yahoo! laceu in 2009, which leu a team ol engineeis theie to
implement secuie authentication loi Hauoop. In theii uesign, Hauoop itsell uoes not
manage usei cieuentials, since it ielies on KeiLeios, a matuie open-souice netwoik
authentication piotocol, to authenticate the usei. In tuin, KeiLeios uoesn`t manage
peimissions. KeiLeios says that a usei is who they say they aie; it`s Hauoop`s joL to
ueteimine whethei that usei has peimission to peiloim a given action. Theie`s a lot to
KeiLeios, so heie we only covei enough to use it in the context ol Hauoop, ieleiiing
ieaueis who want moie Lackgiounu to Kcrbcros: Thc Dcjinitivc Guidc Ly ]ason Gaiman
(O`Reilly, 2003).
Which Versions of Hadoop Support Kerberos Authentication?
KeiLeios loi authentication was liist auueu in the 0.20.20x seiies ol ieleases ol Apache
Hauoop. See TaLle 1-2 loi which iecent ielease seiies suppoit this leatuie.
Kerberos and Hadoop
At a high level, theie aie thiee steps that a client must take to access a seivice when
using KeiLeios, each ol which involves a message exchange with a seivei:
1. Authcntication. The client authenticates itsell to the Authentication Seivei anu
ieceives a timestampeu Ticket-Gianting Ticket (TGT).
2. Authorization. The client uses the TGT to ieguest a seivice ticket liom the Ticket
Gianting Seivei.
3. Scrvicc Rcqucst. The client uses the seivice ticket to authenticate itsell to the seivei
that is pioviuing the seivice the client is using. In the case ol Hauoop, this might
Le the namenoue oi the joLtiackei.
Togethei, the Authentication Seivei anu the Ticket Gianting Seivei loim the Kcy Dis-
tribution Ccntcr (KDC). The piocess is shown giaphically in Figuie 9-2.
324 | Chapter 9: Setting Up a Hadoop Cluster
Iigurc 9-2. Thc thrcc-stcp Kcrbcros tic|ct cxchangc protoco|
The authoiization anu seivice ieguest steps aie not usei-level actions: the client pei-
loims these steps on the usei`s Lehall. The authentication step, howevei, is noimally
caiiieu out explicitly Ly the usei using the kinit commanu, which will piompt loi a
passwoiu. Howevei, this uoesn`t mean you neeu to entei youi passwoiu eveiy time
you iun a joL oi access HDFS, since TGTs last loi 10 houis Ly uelault (anu can Le
ieneweu loi up to a week). It`s common to automate authentication at opeiating system
login time, theieLy pioviuing sing|c sign-on to Hauoop.
In cases wheie you uon`t want to Le piompteu loi a passwoiu (loi iunning an
unattenueu MapReuuce joL, loi example), you can cieate a KeiLeios |cytab lile using
the ktutil commanu. A keytaL is a lile that stoies passwoius anu may Le supplieu to
kinit with the -t option.
An example
Let`s look at an example ol the piocess in action. The liist step is to enaLle KeiLeios
authentication Ly setting the hadoop.security.authentication piopeity in corc-
sitc.xn| to kerberos.
The uelault setting is simple, which signilies that the olu
Lackwaius-compatiLle (Lut insecuie) Lehavioi ol using the opeiating system usei name
to ueteimine iuentity shoulu Le employeu.
6. To use KeiLeios authentication with Hauoop, you neeu to install, conliguie, anu iun a KDC (Hauoop
uoes not come with one). Youi oiganization may alieauy have a KDC you can use (an Active Diiectoiy
installation, loi example); il not, you can set up an MIT KeiLeios 5 KDC using the instiuctions in the
Linux Sccurity Coo|boo| (O`Reilly, 2003).
Security | 325
Ve also neeu to enaLle seivice-level authoiization Ly setting hadoop.security.author
ization to true in the same lile. You may conliguie Access Contiol Lists (ACLs) in the
hadoop-po|icy.xn| conliguiation lile to contiol which useis anu gioups have peimission
to connect to each Hauoop seivice. Seivices aie uelineu at the piotocol level, so theie
aie ones loi MapReuuce joL suLmission, namenoue communication, anu so on. By
uelault, all ACLs aie set to *, which means that all useis have peimission to access each
seivice, Lut on a ieal clustei you shoulu lock the ACLs uown to only those useis anu
gioups that shoulu have access.
The loimat loi an ACL is a comma-sepaiateu list ol useinames, lolloweu Ly whitespace,
lolloweu Ly a comma-sepaiateu list ol gioup names. Foi example, the ACL
preston,howard directors,inventors woulu authoiize access to useis nameu preston
oi howard, oi in gioups directors oi inventors.
Vith KeiLeios authentication tuineu on, let`s see what happens when we tiy to copy
a local lile to HDFS:
% hadoop fs -put quangle.txt .
10/07/03 15:44:58 WARN ipc.Client: Exception encountered while connecting to the
server: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSEx
ception: No valid credentials provided (Mechanism level: Failed to find any Ker
beros tgt)]
Bad connection to FS. command aborted. exception: Call to localhost/
20 failed on local exception: java.io.IOException: javax.security.sasl.SaslExcep
tion: GSS initiate failed [Caused by GSSException: No valid credentials provided
(Mechanism level: Failed to find any Kerberos tgt)]
The opeiation lails, since we uon`t have a KeiLeios ticket. Ve can get one Ly authen-
ticating to the KDC, using kinit:
% kinit
Password for hadoop-user@LOCALDOMAIN: password
% hadoop fs -put quangle.txt .
% hadoop fs -stat %n quangle.txt
Anu we see that the lile is successlully wiitten to HDFS. Notice that even though we
caiiieu out two lilesystem commanus, we only neeueu to call kinit once, since the
KeiLeios ticket is valiu loi 10 houis (use the klist commanu to see the expiiy time ol
youi tickets anu kdestroy to invaliuate youi tickets). Altei we get a ticket, eveiything
woiks just as noimal.
Delegation Tokens
In a uistiiLuteu system like HDFS oi MapReuuce, theie aie many client-seivei intei-
actions, each ol which must Le authenticateu. Foi example, an HDFS ieau opeiation
will involve multiple calls to the namenoue anu calls to one oi moie uatanoues. Insteau
ol using the thiee-step KeiLeios ticket exchange piotocol to authenticate each call,
which woulu piesent a high loau on the KDC on a Lusy clustei, Hauoop uses dc|cgation
to|cns to allow latei authenticateu access without having to contact the KDC again.
326 | Chapter 9: Setting Up a Hadoop Cluster
Delegation tokens aie cieateu anu useu tianspaiently Ly Hauoop on Lehall ol useis, so
theie`s no action you neeu to take as a usei ovei using kinit to sign in, howevei it`s
uselul to have a Lasic iuea ol how they aie useu.
A uelegation token is geneiateu Ly the seivei (the namenoue in this case), anu can Le
thought ol as a shaieu seciet Letween the client anu the seivei. On the liist RPC call to
the namenoue, the client has no uelegation token, so it uses KeiLeios to authenticate,
anu as a pait ol the iesponse it gets a uelegation token liom the namenoue. In suLse-
guent calls, it piesents the uelegation token, which the namenoue can veiily (since it
geneiateu it using a seciet key), anu hence the client is authenticateu to the seivei.
Vhen it wants to peiloim opeiations on HDFS Llocks, the client uses a special kinu ol
uelegation token, calleu a b|oc| acccss to|cn, that the namenoue passes to the client in
iesponse to a metauata ieguest. The client uses the Llock access token to authenticate
itsell to uatanoues. This is possiLle only Lecause the namenoue shaies its seciet key
useu to geneiate the Llock access token with uatanoues (which it senus in heaitLeat
messages), so that they can veiily Llock access tokens. Thus, an HDFS Llock may only
Le accesseu Ly a client with a valiu Llock access token liom a namenoue. This closes
the secuiity hole in unsecuieu Hauoop wheie only the Llock ID was neeueu to gain
access to a Llock. This piopeity is enaLleu Ly setting dfs.block.access.token.enable
to true.
In MapReuuce, joL iesouices anu metauata (such as ]AR liles, input splits, conliguia-
tion liles) aie shaieu in HDFS loi the joLtiackei to access, anu usei coue iuns on the
tasktiackeis anu accesses liles on HDFS (the piocess is explaineu in ¨Anatomy ol a
MapReuuce ]oL Run¨ on page 1S7). Delegation tokens aie useu Ly the joLtiackei anu
tasktiackeis to access HDFS uuiing the couise ol the joL. Vhen the joL has linisheu,
the uelegation tokens aie invaliuateu.
Delegation tokens aie automatically oLtaineu loi the uelault HDFS instance, Lut il youi
joL neeus to access othei HDFS clusteis, then you can have the uelegation tokens loi
these loaueu Ly setting the mapreduce.job.hdfs-servers joL piopeity to a comma-
sepaiateu list ol HDFS URIs.
Other Security Enhancements
Secuiity has Leen tighteneu thioughout HDFS anu MapReuuce to piotect against un-
authoiizeu access to iesouices.
The moie notaLle changes aie listeu heie:
º Tasks can Le iun using the opeiating system account loi the usei who suLmitteu
the joL, iathei than the usei iunning the tasktiackei. This means that the opeiating
system is useu to isolate iunning tasks, so they can`t senu signals to each othei (to
kill anothei usei`s tasks, loi example), anu so local inloimation, such as task uata,
is kept piivate via local lile system peimissions.
7. At the time ol wiiting, othei piojects like HBase anu Hive hau not Leen integiateu with this secuiity mouel.
Security | 327
This leatuie is enaLleu Ly setting mapred.task.tracker.task-controller to
In auuition, auministiatois
neeu to ensuie that each usei is given an account on eveiy noue in the clustei
(typically using LDAP).
º Vhen tasks aie iun as the usei who suLmitteu the joL, the uistiiLuteu cache
(¨DistiiLuteu Cache¨ on page 2SS) is secuie: liles that aie woilu-ieauaLle aie put
in a shaieu cache (the insecuie uelault), otheiwise they go in a piivate cache, only
ieauaLle Ly the ownei.
º Useis can view anu mouily only theii own joLs, not otheis. This is enaLleu Ly
setting mapred.acls.enabled to true. Theie aie two joL conliguiation piopeities,
mapreduce.job.acl-view-job anu mapreduce.job.acl-modify-job, which may Le set
to a comma-sepaiateu list ol useis to contiol who may view oi mouily a paiticulai
º The shullle is secuie, pieventing a malicious usei liom ieguesting anothei usei`s
map outputs. Howevei, the shullle is not enciypteu, so it is suLject to malicious
º Vhen appiopiiately conliguieu, it`s no longei possiLle loi a malicious usei to iun
a iogue seconuaiy namenoue, uatanoue, oi tasktiackei that can join the clustei
anu potentially compiomise uata stoieu in the clustei. This is enloiceu Ly ieguiiing
uaemons to authenticate with the mastei noue they aie connecting to.
To enaLle this leatuie, you liist neeu to conliguie Hauoop to use a keytaL pievi-
ously geneiateu with the ktutil commanu. Foi a uatanoue, loi example, you woulu
set the dfs.datanode.keytab.file piopeity to the keytaL lilename anu dfs.data
node.kerberos.principal to the useiname to use loi the uatanoue. Finally, the ACL
loi the DataNodeProtocol (which is useu Ly uatanoues to communicate with the
namenoue) must Le set in hadoop-po|icy.xn|, Ly iestiicting security.datanode.pro
tocol.acl to the uatanoue`s useiname.
º A uatanoue may Le iun on a piivilegeu poit (one lowei than 102+), so a client may
Le ieasonaLly suie that it was staiteu secuiely.
º A task may only communicate with its paient tasktiackei, thus pieventing an
attackei liom oLtaining MapReuuce uata liom anothei usei`s joL.
One aiea that hasn`t yet Leen auuiesseu in the secuiity woik is enciyption: neithei RPC
noi Llock tiansleis aie enciypteu. HDFS Llocks aie not stoieu in an enciypteu loim
eithei. These leatuies aie planneu loi a lutuie ielease, anu in lact, enciypting the uata
stoieu in HDFS coulu Le caiiieu out in existing veisions ol Hauoop Ly the application
itsell (Ly wiiting an enciyption CompressionCodec, loi example).
S. LinuxTaskController uses a setuiu executaLle calleu tas|-contro||cr lounu in the bin uiiectoiy. You shoulu
ensuie that this Linaiy is owneu Ly ioot anu has the setuiu Lit set (with chmod +s).
328 | Chapter 9: Setting Up a Hadoop Cluster
Benchmarking a Hadoop Cluster
Is the clustei set up coiiectly? The Lest way to answei this guestion is empiiically: iun
some joLs anu conliim that you get the expecteu iesults. Benchmaiks make goou tests,
as you also get numLeis that you can compaie with othei clusteis as a sanity check on
whethei youi new clustei is peiloiming ioughly as expecteu. Anu you can tune a clustei
using Lenchmaik iesults to sgueeze the Lest peiloimance out ol it. This is olten uone
with monitoiing systems in place (¨Monitoiing¨ on page 3+9), so you can see how
iesouices aie Leing useu acioss the clustei.
To get the Lest iesults, you shoulu iun Lenchmaiks on a clustei that is not Leing useu
Ly otheis. In piactice, this is just Leloie it is put into seivice anu useis stait ielying on
it. Once useis have peiiouically scheuuleu joLs on a clustei, it is geneially impossiLle
to linu a time when the clustei is not Leing useu (unless you aiiange uowntime with
useis), so you shoulu iun Lenchmaiks to youi satislaction Leloie this happens.
Expeiience has shown that most haiuwaie lailuies loi new systems aie haiu uiive lail-
uies. By iunning I/O intensive Lenchmaiks÷such as the ones uesciiLeu next÷you
can ¨Luin in¨ the clustei Leloie it goes live.
Hadoop Benchmarks
Hauoop comes with seveial Lenchmaiks that you can iun veiy easily with minimal
setup cost. Benchmaiks aie packageu in the test ]AR lile, anu you can get a list ol them,
with uesciiptions, Ly invoking the ]AR lile with no aiguments:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
Most ol the Lenchmaiks show usage instiuctions when invokeu with no aiguments.
Foi example:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO
Usage: TestFDSIO -read | -write | -clean [-nrFiles N] [-fileSize MB] [-resFile
resultFileName] [-bufferSize Bytes]
Benchmarking HDFS with TestDFSIO
TestDFSIO tests the I/O peiloimance ol HDFS. It uoes this Ly using a MapReuuce joL
as a convenient way to ieau oi wiite liles in paiallel. Each lile is ieau oi wiitten in a
sepaiate map task, anu the output ol the map is useu loi collecting statistics ielating
to the lile just piocesseu. The statistics aie accumulateu in the ieuuce to piouuce a
The lollowing commanu wiites 10 liles ol 1,000 MB each:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10
-fileSize 1000
Benchmarking a Hadoop Cluster | 329
At the enu ol the iun, the iesults aie wiitten to the console anu also iecoiueu in a local
lile (which is appenueu to, so you can ieiun the Lenchmaik anu not lose olu iesults):
% cat TestDFSIO_results.log
----- TestDFSIO ----- : write
Date & time: Sun Apr 12 07:14:09 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 7.796340865378244
Average IO rate mb/sec: 7.8862199783325195
IO rate std deviation: 0.9101254683525547
Test exec time sec: 163.387
The liles aie wiitten unuei the /bcnchnar|s/TcstDIS|O uiiectoiy Ly uelault (this can
Le changeu Ly setting the test.build.data system piopeity), in a uiiectoiy calleu
To iun a ieau Lenchmaik, use the -read aigument. Note that these liles must alieauy
exist (having Leen wiitten Ly TestDFSIO -write):
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10
-fileSize 1000
Heie aie the iesults loi a ieal iun:
----- TestDFSIO ----- : read
Date & time: Sun Apr 12 07:24:28 EDT 2009
Number of files: 10
Total MBytes processed: 10000
Throughput mb/sec: 80.25553361904304
Average IO rate mb/sec: 98.6801528930664
IO rate std deviation: 36.63507598174921
Test exec time sec: 47.624
Vhen you`ve linisheu Lenchmaiking, you can uelete all the geneiateu liles liom HDFS
using the -clean aigument:
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
Benchmarking MapReduce with Sort
Hauoop comes with a MapReuuce piogiam that uoes a paitial soit ol its input. It is
veiy uselul loi Lenchmaiking the whole MapReuuce system, as the lull input uataset
is tiansleiieu thiough the shullle. The thiee steps aie: geneiate some ianuom uata,
peiloim the soit, then valiuate the iesults.
Fiist we geneiate some ianuom uata using RandomWriter. It iuns a MapReuuce joL
with 10 maps pei noue, anu each map geneiates (appioximately) 10 GB ol ianuom
Linaiy uata, with key anu values ol vaiious sizes. You can change these values il you
like Ly setting the piopeities test.randomwriter.maps_per_host anu
test.randomwrite.bytes_per_map. Theie aie also settings loi the size ianges ol the keys
anu values; see RandomWriter loi uetails.
330 | Chapter 9: Setting Up a Hadoop Cluster
Heie`s how to invoke RandomWriter (lounu in the example ]AR lile, not the test one) to
wiite its output to a uiiectoiy calleu randon-data:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data
Next we can iun the Sort piogiam:
% hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data
The oveiall execution time ol the soit is the metiic we aie inteiesteu in, Lut it`s in-
stiuctive to watch the joL`s piogiess via the weL UI (http://jobtrac|cr-host:50030/),
wheie you can get a leel loi how long each phase ol the joL takes. Aujusting the
paiameteis mentioneu in ¨Tuning a ]oL¨ on page 176 is a uselul exeicise, too.
As a linal sanity check, we valiuate that the uata in sortcd-data is, in lact, coiiectly
% hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data \
-sortOutput sorted-data
This commanu iuns the SortValidator piogiam, which peiloims a seiies ol checks on
the unsoiteu anu soiteu uata to check whethei the soit is accuiate. It iepoits the out-
come to the console at the enu ol its iun:
SUCCESS! Validated the MapReduce framework's 'sort' successfully.
Other benchmarks
Theie aie many moie Hauoop Lenchmaiks, Lut the lollowing aie wiuely useu:
º MRBench (invokeu with mrbench) iuns a small joL a numLei ol times. It acts as a goou
counteipoint to soit, as it checks whethei small joL iuns aie iesponsive.
º NNBench (invokeu with nnbench) is uselul loi loau testing namenoue haiuwaie.
º Gridnix is a suite ol Lenchmaiks uesigneu to mouel a iealistic clustei woikloau,
Ly mimicking a vaiiety ol uata-access patteins seen in piactice. See the uocumen-
tation in the uistiiLution loi how to iun Giiumix, anu the Llog post at http://dcvc|
opcr.yahoo.nct/b|ogs/hadoop/2010/01/gridnix3_cnu|ating_production.htn| loi
moie Lackgiounu.
User Jobs
Foi tuning, it is Lest to incluue a lew joLs that aie iepiesentative ol the joLs that youi
useis iun, so youi clustei is tuneu loi these anu not just loi the stanuaiu Lenchmaiks.
Il this is youi liist Hauoop clustei anu you uon`t have any usei joLs yet, then Giiumix
is a goou suLstitute.
9. In a similai vein, PigMix is a set ol Lenchmaiks loi Pig availaLle at https://cwi|i.apachc.org/conj|ucncc/
Benchmarking a Hadoop Cluster | 331
Vhen iunning youi own joLs as Lenchmaiks, you shoulu select a uataset loi youi usei
joLs that you use each time you iun the Lenchmaiks to allow compaiisons Letween
iuns. Vhen you set up a new clustei, oi upgiaue a clustei, you will Le aLle to use the
same uataset to compaie the peiloimance with pievious iuns.
Hadoop in the Cloud
Although many oiganizations choose to iun Hauoop in-house, it is also populai to iun
Hauoop in the clouu on ienteu haiuwaie oi as a seivice. Foi instance, Clouueia olleis
tools loi iunning Hauoop (see Appenuix B) in a puLlic oi piivate clouu, anu Amazon
has a Hauoop clouu seivice calleu Elastic MapReuuce.
In this section, we look at iunning Hauoop on Amazon EC2, which is a gieat way to
tiy out youi own Hauoop clustei on a low-commitment, tiial Lasis.
Hadoop on Amazon EC2
Amazon Elastic Compute Clouu (EC2) is a computing seivice that allows customeis
to ient computeis (instanccs) on which they can iun theii own applications. A customei
can launch anu teiminate instances on uemanu, paying Ly the houi loi active instances.
The Apache Vhiii pioject (http://whirr.apachc.org/) pioviues a ]ava API anu a set ol
sciipts that make it easy to iun Hauoop on EC2 anu othei clouu pioviueis.
The sciipts
allow you to peiloim such opeiations as launching oi teiminating a clustei, oi listing
the iunning instances in a clustei.
Running Hauoop on EC2 is especially appiopiiate loi ceitain woikllows. Foi example,
il you stoie uata on Amazon S3, then you can iun a clustei on EC2 anu iun MapReuuce
joLs that ieau the S3 uata anu wiite output Lack to S3, Leloie shutting uown the clustei.
Il you`ie woiking with longei-liveu clusteis, you might copy S3 uata onto HDFS iun-
ning on EC2 loi moie ellicient piocessing, as HDFS can take auvantage ol uata locality,
Lut S3 cannot (since S3 stoiage is not collocateu with EC2 noues).
Fiist install Vhiii Ly uownloauing a iecent ielease taiLall, anu unpacking it on the
machine you want to launch the clustei liom, as lollows:
% tar xzf whirr-x.y.z.tar.gz
Vhiii uses SSH to communicate with machines iunning in the clouu, so it`s a goou
iuea to geneiate an SSH keypaii loi exclusive use with Vhiii. Heie we cieate an RSA
keypaii with an empty passphiase, stoieu in a lile calleu id_rsa_whirr in the cuiient
usei`s .ssh uiiectoiy:
10. Theie aie also Lash sciipts in the src/contrib/cc2 suLuiiectoiy ol the Hauoop uistiiLution, Lut these aie
uepiecateu in lavoi ol Vhiii.
332 | Chapter 9: Setting Up a Hadoop Cluster
% ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa_whirr
Do not conluse the Vhiii SSH keypaii with any ceitilicates, piivate
keys, oi SSH keypaiis associateu with youi Amazon VeL Seivices ac-
count. Vhiii is uesigneu to woik with many clouu pioviueis, anu it
must have access to Loth the puLlic anu piivate SSH key ol a passphiase-
less keypaii that`s ieau liom the local lilesystem. In piactice, it`s simplest
to geneiate a new keypaii loi Vhiii, as we uiu heie.
Ve neeu to tell Vhiii oui clouu pioviuei cieuentials. Ve can expoit them as enviion-
ment vaiiaLles as lollows, although you can alteinatively specily them on the commanu
line, oi in the conliguiation lile loi the seivice.
% export AWS_ACCESS_KEY_ID='...'
% export AWS_SECRET_ACCESS_KEY='...'
Launching a cluster
Ve aie now ieauy to launch a clustei. Vhiii comes with seveial iecipes liles loi
launching common seivice conliguiations, anu heie we use the iecipe to iun Hauoop
on EC2:
% bin/whirr launch-cluster --config recipes/hadoop-ec2.properties \
--private-key-file ~/.ssh/id_rsa_whirr
The launch-cluster commanu piovisions the clouu instances anu staits the seivices
iunning on them, Leloie ietuining contiol to the usei.
Beloie we stait using the clustei, let`s look at Vhiii conliguiation in moie uetail. Con-
liguiation paiameteis aie passeu to Vhiii commanus as Lunules in a conliguiation lile
specilieu Ly the --config option, oi inuiviuually using commanu line aiguments, like
the --private-key-file aigument we useu to inuicate the location ol the SSH piivate
key lile.
The iecipe lile is actually just a ]ava piopeities lile that uelines a numLei ol Vhiii
piopeities. Let`s step thiough the salient piopeities liom hadoop-cc2.propcrtics, stait-
ing with the two piopeities that ueline the clustei anu the seivices iunning on it:
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,5 hadoop-datanode+hadoop-tasktracker
Eveiy clustei has a name, specilieu Ly whirr.cluster-name, which seives to iuentily the
clustei so you can peiloim opeiations on it, like listing all iunning instances, oi tei-
mining the clustei. The name must Le unigue within the clouu account that the clustei
is iunning in.
The whirr.instance-templates piopeity uelines the seivices that iun on a clustei. An
instance template specilies a caiuinality, anu a set ol ioles that iun on each instance ol
Hadoop in the Cloud | 333
that type. Thus, we have one instance iunning in Loth the hadoop-namenode iole anu
the hadoop-jobtracker iole. Theie aie also 5 instances iunning a hadoop-datanode anu
a hadoop-tasktracker. Vith whirr.instance-templates, you can ueline the piecise
composition ol youi clustei÷theie aie plenty ol othei seivices you can iun in auuition
to Hauoop, anu you can uiscovei them Ly iunning bin/whirr with no aiguments.
The next gioup ol piopeities specily clouu cieuentials:
The whirr.provider piopeity uelines the clouu pioviuei, heie EC2, (othei suppoiteu
pioviueis aie listeu in the Vhiii uocumentation). The whirr.identity anu whirr.cre
dentialpiopeities aie the clouu-specilic cieuentials÷ioughly speaking the useiname
anu passwoiu, although the teiminology vaiies liom pioviuei to pioviuei.
The linal thiee paiameteis ollei contiol ovei the clustei haiuwaie (instance capaLilities,
like memoiy, uisk, CPU, netwoik speeu), the machine image (opeiating system), anu
geogiaphic location (uata centei). These aie all pioviuei uepenuent, Lut il you omit
them then Vhiii will tiy to pick goou uelaults.
Piopeities in the lile aie pielixeu with whirr., Lut il they aie passeu as aiguments on
the commanu line, then the pielix is uioppeu. So loi example, you coulu set the clustei
name Ly auuing --cluster-name hadoop on the commanu line (anu this woulu take
pieceuence ovei any value set in the piopeities lile). Conveisely, we coulu have set the
piivate key lile in the piopeities lile Ly auuing a line like
Theie aie also piopeities loi specilying the veision ol Hauoop to iun on the clustei,
anu loi setting Hauoop conliguiation piopeities acioss the clustei (uetails aie in the
iecipe lile).
Running a proxy
To use the clustei, netwoik tiallic liom the client neeus to Le pioxieu thiough the
mastei noue ol the clustei using an SSH tunnel, which we can set up using the lollowing
% . ~/.whirr/hadoop/hadoop-proxy.sh
You shoulu keep the pioxy iunning as long as the clustei is iunning. Vhen you have
linisheu with the clustei, stop the pioxy with Ctil-c.
334 | Chapter 9: Setting Up a Hadoop Cluster
Running a MapReduce job
You can iun MapReuuce joLs eithei liom within the clustei oi liom an exteinal ma-
chine. Heie we show how to iun a joL liom the machine we launcheu the clustei on.
Note that this ieguiies that the same veision ol Hauoop has Leen installeu locally as is
iunning on the clustei.
Vhen we launcheu the clustei, Hauoop site conliguiation liles weie cieateu in the
uiiectoiy -/.whirr/hadoop. Ve can use this to connect to the clustei Ly setting the
HADOOP_CONF_DIR enviionment vaiiaLle as lollows:
% export HADOOP_CONF_DIR=~/.whirr/hadoop
The clustei`s lilesystem is empty, so Leloie we iun a joL, we neeu to populate it with
uata. Doing a paiallel copy liom S3 (see ¨Hauoop Filesystems¨ on page 5+ loi moie on
the S3 lilesystems in Hauoop) using Hauoop`s distcp tool is an ellicient way to tianslei
uata into HDFS:
% hadoop distcp \
-Dfs.s3n.awsAccessKeyId='...' \
-Dfs.s3n.awsSecretAccessKey='...' \
s3n://hadoopbook/ncdc/all input/ncdc/all
The peimissions on the liles in the hadoopboo| S3 Lucket only allow
copying to the US East EC2 iegion. This means you shoulu iun the
distcp commanu liom within that iegion÷the easiest way to achieve
that is to log into the mastei noue (its auuiess is piinteu to the console
uuiing launch) with
% ssh -i ~/.ssh/id_rsa_whirr master_host
Altei the uata has Leen copieu, we can iun a joL in the usual way:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \
/user/$USER/input/ncdc/all /user/$USER/output
Alteinatively, we coulu have specilieu the output to Le on S3, as lollows:
% hadoop jar hadoop-examples.jar MaxTemperatureWithCombiner \
/user/$USER/input/ncdc/all s3n://mybucket/output
You can tiack the piogiess ol the joL using the joLtiackei`s weL UI, lounu at http://
nastcr_host:50030/. To access weL pages iunning on woikei noues, you neeu set up a
pioxy auto-conlig (PAC) lile in youi Liowsei. See the Vhiii uocumentation loi uetails
on how to uo this.
Shutting down a cluster
To shut uown the clustei, issue the destroy-cluster commanu:
% bin/whirr destroy-cluster --config recipes/hadoop-ec2.properties
Hadoop in the Cloud | 335
This will teiminate all the iunning instances in the clustei anu uelete all the uata stoieu
in the clustei.
336 | Chapter 9: Setting Up a Hadoop Cluster
Administering Hadoop
The pievious chaptei was uevoteu to setting up a Hauoop clustei. In this chaptei, we
look at the pioceuuies to keep a clustei iunning smoothly.
Persistent Data Structures
As an auministiatoi, it is invaluaLle to have a Lasic unueistanuing ol how the compo-
nents ol HDFS÷the namenoue, the seconuaiy namenoue, anu the uatanoues÷
oiganize theii peisistent uata on uisk. Knowing which liles aie which can help you
uiagnose pioLlems oi spot that something is awiy.
Namenode directory structure
A newly loimatteu namenoue cieates the lollowing uiiectoiy stiuctuie:
Recall liom Chaptei 9 that the dfs.name.dir piopeity is a list ol uiiectoiies, with the
same contents miiioieu in each uiiectoiy. This mechanism pioviues iesilience, paitic-
ulaily il one ol the uiiectoiies is an NFS mount, as is iecommenueu.
The \ERS|ON lile is a ]ava piopeities lile that contains inloimation aLout the veision
ol HDFS that is iunning. Heie aie the contents ol a typical lile:
#Tue Mar 10 19:21:36 GMT 2009
The layoutVersion is a negative integei that uelines the veision ol HDFS`s peisistent
uata stiuctuies. This veision numLei has no ielation to the ielease numLei ol the Ha-
uoop uistiiLution. Vhenevei the layout changes, the veision numLei is ueciementeu
(loi example, the veision altei -1S is -19). Vhen this happens, HDFS neeus to Le
upgiaueu, since a newei namenoue (oi uatanoue) will not opeiate il its stoiage layout
is an oluei veision. Upgiauing HDFS is coveieu in ¨Upgiaues¨ on page 360.
The namespaceID is a unigue iuentiliei loi the lilesystem, which is cieateu when the
lilesystem is liist loimatteu. The namenoue uses it to iuentily new uatanoues, since
they will not know the namespaceID until they have iegisteieu with the namenoue.
The cTime piopeity maiks the cieation time ol the namenoue`s stoiage. Foi newly loi-
matteu stoiage, the value is always zeio, Lut it is upuateu to a timestamp whenevei the
lilesystem is upgiaueu.
The storageType inuicates that this stoiage uiiectoiy contains uata stiuctuies loi a
The othei liles in the namenoue`s stoiage uiiectoiy aie cdits, jsinagc, anu jstinc. These
aie all Linaiy liles, which use Hauoop Writable oLjects as theii seiialization loimat (see
¨Seiialization¨ on page 9+). To unueistanu what these liles aie loi, we neeu to uig into
the woikings ol the namenoue a little moie.
The filesystem image and edit log
Vhen a lilesystem client peiloims a wiite opeiation (such as cieating oi moving a lile),
it is liist iecoiueu in the euit log. The namenoue also has an in-memoiy iepiesentation
ol the lilesystem metauata, which it upuates altei the euit log has Leen mouilieu. The
in-memoiy metauata is useu to seive ieau ieguests.
The euit log is llusheu anu synceu altei eveiy wiite Leloie a success coue is ietuineu to
the client. Foi namenoues that wiite to multiple uiiectoiies, the wiite must Le llusheu
anu synceu to eveiy copy Leloie ietuining successlully. This ensuies that no opeiation
is lost uue to machine lailuie.
The jsinagc lile is a peisistent checkpoint ol the lilesystem metauata. Howevei, it is
not upuateu loi eveiy lilesystem wiite opeiation, since wiiting out the jsinagc lile,
which can giow to Le gigaLytes in size, woulu Le veiy slow. This uoes not compiomise
iesilience, howevei, Lecause il the namenoue lails, then the latest state ol its metauata
can Le ieconstiucteu Ly loauing the jsinagc liom uisk into memoiy, then applying each
ol the opeiations in the euit log. In lact, this is piecisely what the namenoue uoes when
it staits up (see ¨Sale Moue¨ on page 3+2).
338 | Chapter 10: Administering Hadoop
The jsinagc lile contains a seiializeu loim ol all the uiiectoiy anu lile
inoues in the lilesystem. Each inoue is an inteinal iepiesentation ol a
lile oi uiiectoiy`s metauata anu contains such inloimation as the lile`s
ieplication level, mouilication anu access times, access peimissions,
Llock size, anu the Llocks a lile is maue up ol. Foi uiiectoiies, the mou-
ilication time, peimissions, anu guota metauata is stoieu.
The jsinagc lile uoes not iecoiu the uatanoues on which the Llocks aie
stoieu. Insteau the namenoue keeps this mapping in memoiy, which it
constiucts Ly asking the uatanoues loi theii Llock lists when they join
the clustei anu peiiouically alteiwaiu to ensuie the namenoue`s Llock
mapping is up-to-uate.
As uesciiLeu, the cdits lile woulu giow without Lounu. Though this state ol allaiis
woulu have no impact on the system while the namenoue is iunning, il the namenoue
weie iestaiteu, it woulu take a long time to apply each ol the opeiations in its (veiy
long) euit log. Duiing this time, the lilesystem woulu Le ollline, which is geneially
The solution is to iun the seconuaiy namenoue, whose puipose is to piouuce check-
points ol the piimaiy`s in-memoiy lilesystem metauata.
The checkpointing piocess
pioceeus as lollows (anu is shown schematically in Figuie 10-1):
1. The seconuaiy asks the piimaiy to ioll its cdits lile, so new euits go to a new lile.
2. The seconuaiy ietiieves jsinagc anu cdits liom the piimaiy (using HTTP GET).
3. The seconuaiy loaus jsinagc into memoiy, applies each opeiation liom cdits, then
cieates a new consoliuateu jsinagc lile.
+. The seconuaiy senus the new jsinagc Lack to the piimaiy (using HTTP POST).
5. The piimaiy ieplaces the olu jsinagc with the new one liom the seconuaiy, anu
the olu cdits lile with the new one it staiteu in step 1. It also upuates the jstinc lile
to iecoiu the time that the checkpoint was taken.
At the enu ol the piocess, the piimaiy has an up-to-uate jsinagc lile anu a shoitei
cdits lile (it is not necessaiily empty, as it may have ieceiveu some euits while the
checkpoint was Leing taken). It is possiLle loi an auministiatoi to iun this piocess
manually while the namenoue is in sale moue, using the hadoop dfsadmin
-saveNamespace commanu.
1. Fiom Hauoop veision 0.22.0 onwaius you can stait a namenoue with the -checkpoint option so that it
iuns the checkpointing piocess against anothei (piimaiy) namenoue. This is lunctionally eguivalent to
iunning a seconuaiy namenoue, Lut at the time ol wiiting olleis no auvantages ovei the seconuaiy
namenoue (anu inueeu the seconuaiy namenoue is the most tiieu anu testeu option). Vhen iunning in
a high-availaLility enviionment (¨HDFS High-AvailaLility¨ on page 50), it will Le possiLle loi the stanuLy
noue to uo checkpointing.
HDFS | 339
This pioceuuie makes it cleai why the seconuaiy has similai memoiy ieguiiements to
the piimaiy (since it loaus the jsinagc into memoiy), which is the ieason that the sec-
onuaiy neeus a ueuicateu machine on laige clusteis.
The scheuule loi checkpointing is contiolleu Ly two conliguiation paiameteis. The
seconuaiy namenoue checkpoints eveiy houi (fs.checkpoint.period in seconus) oi
soonei il the euit log has ieacheu 6+ MB (fs.checkpoint.size in Lytes), which it checks
eveiy live minutes.
Secondary namenode directory structure
A uselul siue ellect ol the checkpointing piocess is that the seconuaiy has a checkpoint
at the enu ol the piocess, which can Le lounu in a suLuiiectoiy calleu prcvious.chcc|-
point. This can Le useu as a souice loi making (stale) Lackups ol the namenoue`s
Iigurc 10-1. Thc chcc|pointing proccss
340 | Chapter 10: Administering Hadoop
The layout ol this uiiectoiy anu ol the seconuaiy`s currcnt uiiectoiy is iuentical to the
namenoue`s. This is Ly uesign, since in the event ol total namenoue lailuie (when theie
aie no iecoveiaLle Lackups, even liom NFS), it allows iecoveiy liom a seconuaiy
namenoue. This can Le achieveu eithei Ly copying the ielevant stoiage uiiectoiy to a
new namenoue, oi, il the seconuaiy is taking ovei as the new piimaiy namenoue, Ly
using the -importCheckpoint option when staiting the namenoue uaemon. The
-importCheckpoint option will loau the namenoue metauata liom the latest checkpoint
in the uiiectoiy uelineu Ly the fs.checkpoint.dir piopeity, Lut only il theie is no
metauata in the dfs.name.dir uiiectoiy, so theie is no iisk ol oveiwiiting piecious
Datanode directory structure
Unlike namenoues, uatanoues uo not neeu to Le explicitly loimatteu, since they cieate
theii stoiage uiiectoiies automatically on staitup. Heie aie the key liles anu uiiectoiies:
A uatanoue`s \ERS|ON lile is veiy similai to the namenoue`s:
#Tue Mar 10 21:32:31 GMT 2009
The namespaceID, cTime, anu layoutVersion aie all the same as the values in the name-
noue (in lact, the namespaceID is ietiieveu liom the namenoue when the uatanoue liist
connects). The storageID is unigue to the uatanoue (it is the same acioss all stoiage
uiiectoiies) anu is useu Ly the namenoue to uniguely iuentily the uatanoue. The
storageType iuentilies this uiiectoiy as a uatanoue stoiage uiiectoiy.
HDFS | 341
The othei liles in the uatanoue`s currcnt stoiage uiiectoiy aie the liles with the b||_
pielix. Theie aie two types: the HDFS Llocks themselves (which just consist ol the lile`s
iaw Lytes) anu the metauata loi a Llock (with a .ncta sullix). A Llock lile just consists
ol the iaw Lytes ol a poition ol the lile Leing stoieu; the metauata lile is maue up ol a
heauei with veision anu type inloimation, lolloweu Ly a seiies ol checksums loi sec-
tions ol the Llock.
Vhen the numLei ol Llocks in a uiiectoiy giows to a ceitain size, the uatanoue cieates
a new suLuiiectoiy in which to place new Llocks anu theii accompanying metauata. It
cieates a new suLuiiectoiy eveiy time the numLei ol Llocks in a uiiectoiy ieaches 6+
(set Ly the dfs.datanode.numblocks conliguiation piopeity). The ellect is to have a tiee
with high lan-out, so even loi systems with a veiy laige numLei ol Llocks, the uiiectoiies
will only Le a lew levels ueep. By taking this measuie, the uatanoue ensuies that theie
is a manageaLle numLei ol liles pei uiiectoiy, which avoius the pioLlems that most
opeiating systems encountei when theie aie a laige numLei ol liles (tens oi hunuieus
ol thousanus) in a single uiiectoiy.
Il the conliguiation piopeity dfs.data.dir specilies multiple uiiectoiies (on uilleient
uiives), Llocks aie wiitten to each in a iounu-ioLin lashion. Note that Llocks aie not
ieplicateu on each uiive on a single uatanoue: Llock ieplication is acioss uistinct
Safe Mode
Vhen the namenoue staits, the liist thing it uoes is loau its image lile (jsinagc) into
memoiy anu apply the euits liom the euit log (cdits). Once it has ieconstiucteu a con-
sistent in-memoiy image ol the lilesystem metauata, it cieates a new jsinagc lile
(ellectively uoing the checkpoint itsell, without iecouise to the seconuaiy namenoue)
anu an empty euit log. Only at this point uoes the namenoue stait listening loi RPC
anu HTTP ieguests. Howevei, the namenoue is iunning in sajc nodc, which means
that it olleis only a ieau-only view ol the lilesystem to clients.
Stiictly speaking, in sale moue, only lilesystem opeiations that access
the lilesystem metauata (like piouucing a uiiectoiy listing) aie guaian-
teeu to woik. Reauing a lile will woik only il the Llocks aie availaLle on
the cuiient set ol uatanoues in the clustei; anu lile mouilications (wiites,
ueletes, oi ienames) will always lail.
Recall that the locations ol Llocks in the system aie not peisisteu Ly the namenoue÷
this inloimation iesiues with the uatanoues, in the loim ol a list ol the Llocks it is
stoiing. Duiing noimal opeiation ol the system, the namenoue has a map ol Llock
locations stoieu in memoiy. Sale moue is neeueu to give the uatanoues time to check
in to the namenoue with theii Llock lists, so the namenoue can Le inloimeu ol enough
Llock locations to iun the lilesystem ellectively. Il the namenoue uiun`t wait loi enough
uatanoues to check in, then it woulu stait the piocess ol ieplicating Llocks to new
342 | Chapter 10: Administering Hadoop
uatanoues, which woulu Le unnecessaiy in most cases (since it only neeueu to wait loi
the extia uatanoues to check in), anu woulu put a gieat stiain on the clustei`s iesouices.
Inueeu, while in sale moue, the namenoue uoes not issue any Llock ieplication oi
ueletion instiuctions to uatanoues.
Sale moue is exiteu when the ninina| rcp|ication condition is ieacheu, plus an extension
time ol 30 seconus. The minimal ieplication conuition is when 99.9º ol the Llocks in
the whole lilesystem meet theii minimum ieplication level (which uelaults to one, anu
is set Ly dfs.replication.min, see TaLle 10-1).
Vhen you aie staiting a newly loimatteu HDFS clustei, the namenoue uoes not go into
sale moue since theie aie no Llocks in the system.
Tab|c 10-1. Sajc nodc propcrtics
Property name Type Default value Description
dfs.replication.min int 1 The minimum number of replicas that have to be writ-
ten for a write to be successful.
dfs.safemode.threshold.pct float 0.999 The proportion of blocks in the system that must
meet the minimum replication level defined by
dfs.replication.min before the namenode will
exit safe mode. Setting this value to 0 or less forces the
namenode not to start in safe mode. Setting this value
to more than 1 means the namenode never exits safe
dfs.safemode.extension int 30,000 The time, in milliseconds, to extend safe mode by after
the minimum replication condition defined by
dfs.safemode.threshold.pct has been satis-
fied. For small clusters (tens of nodes), it can be set to
Entering and leaving safe mode
To see whethei the namenoue is in sale moue, you can use the dfsadmin commanu:
% hadoop dfsadmin -safemode get
Safe mode is ON
The liont page ol the HDFS weL UI pioviues anothei inuication ol whethei the name-
noue is in sale moue.
Sometimes you want to wait loi the namenoue to exit sale moue Leloie caiiying out a
commanu, paiticulaily in sciipts. The wait option achieves this:
hadoop dfsadmin -safemode wait
# command to read or write a file
HDFS | 343
An auministiatoi has the aLility to make the namenoue entei oi leave sale moue at any
time. It is sometimes necessaiy to uo this when caiiying out maintenance on the clustei
oi altei upgiauing a clustei to conliim that uata is still ieauaLle. To entei sale moue,
use the lollowing commanu:
% hadoop dfsadmin -safemode enter
Safe mode is ON
You can use this commanu when the namenoue is still in sale moue while staiting up
to ensuie that it nevei leaves sale moue. Anothei way ol making suie that the namenoue
stays in sale moue inuelinitely is to set the piopeity dfs.safemode.threshold.pct to a
value ovei one.
You can make the namenoue leave sale moue Ly using:
% hadoop dfsadmin -safemode leave
Safe mode is OFF
Audit Logging
HDFS has the aLility to log all lilesystem access ieguests, a leatuie that some oigani-
zations ieguiie loi auuiting puiposes. Auuit logging is implementeu using log+j logging
at the INFO level, anu in the uelault conliguiation it is uisaLleu, as the log thiesholu is
set to WARN in |og1j.propcrtics:
You can enaLle auuit logging Ly ieplacing WARN with INFO, anu the iesult will Le a log
line wiitten to the namenoue`s log loi eveiy HDFS event. Heie`s an example loi a list
status ieguest on /uscr/ton:
2009-03-13 07:11:22,982 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
audit: ugi=tom,staff,admin ip=/ cmd=listStatus src=/user/tom dst=null
It is a goou iuea to conliguie log+j so that the auuit log is wiitten to a sepaiate lile anu
isn`t mixeu up with the namenoue`s othei log entiies. An example ol how to uo this
can Le lounu on the Hauoop wiki at http://wi|i.apachc.org/hadoop/HowToConjigurc.
The djsadnin tool is a multipuipose tool loi linuing inloimation aLout the state ol
HDFS, as well as peiloiming auministiation opeiations on HDFS. It is invokeu as
hadoop dfsadmin anu ieguiies supeiusei piivileges.
Some ol the availaLle commanus to djsadnin aie uesciiLeu in TaLle 10-2. Use the
-help commanu to get moie inloimation.
344 | Chapter 10: Administering Hadoop
Tab|c 10-2. djsadnin connands
Command Description
-help Shows help for a given command, or all commands if no command is specified.
-report Shows filesystem statistics (similar to those shown in the web UI) and information on connected
-metasave Dumps information to a file in Hadoop’s log directory about blocks that are being replicated or
deleted, and a list of connected datanodes.
-safemode Changes or query the state of safe mode. See “Safe Mode” on page 342.
-saveNamespace Saves the current in-memory filesystem image to a new fsimage file and resets the edits file. This
operation may be performed only in safe mode.
-refreshNodes Updates the set of datanodes that are permitted to connect to the namenode. See “Commissioning
and Decommissioning Nodes” on page 357.
-upgradeProgress Gets information on the progress of an HDFS upgrade or forces an upgrade to proceed. See
“Upgrades” on page 360.
-finalizeUpgrade Removes the previous version of the datanodes’ and namenode’s storage directories. Used after
an upgrade has been applied and the cluster is running successfully on the new version. See
“Upgrades” on page 360.
-setQuota Sets directory quotas. Directory quotas set a limit on the number of names (files or directories) in
the directory tree. Directory quotas are useful for preventing users from creating large numbers
of small files, a measure that helps preserve the namenode’s memory (recall that accounting
information for every file, directory, and block in the filesystem is stored in memory).
-clrQuota Clears specified directory quotas.
-setSpaceQuota Sets space quotas on directories. Space quotas set a limit on the size of files that may be stored in
a directory tree. They are useful for giving users a limited amount of storage.
-clrSpaceQuota Clears specified space quotas.
-refreshServiceAcl Refreshes the namenode’s service-level authorization policy file.
Filesystem check (fsck)
Hauoop pioviues an jsc| utility loi checking the health ol liles in HDFS. The tool looks
loi Llocks that aie missing liom all uatanoues, as well as unuei- oi ovei-ieplicateu
Llocks. Heie is an example ol checking the whole lilesystem loi a small clustei:
% hadoop fsck /
......................Status: HEALTHY
Total size: 511799225 B
Total dirs: 10
Total files: 22
Total blocks (validated): 22 (avg. block size 23263601 B)
Minimally replicated blocks: 22 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
HDFS | 345
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 4
Number of racks: 1
The filesystem under path '/' is HEALTHY
jsc| iecuisively walks the lilesystem namespace, staiting at the given path (heie the
lilesystem ioot), anu checks the liles it linus. It piints a uot loi eveiy lile it checks. To
check a lile, jsc| ietiieves the metauata loi the lile`s Llocks anu looks loi pioLlems oi
inconsistencies. Note that jsc| ietiieves all ol its inloimation liom the namenoue; it
uoes not communicate with any uatanoues to actually ietiieve any Llock uata.
Most ol the output liom jsc| is sell-explanatoiy, Lut heie aie some ol the conuitions it
looks loi:
Ovcr-rcp|icatcd b|oc|s
These aie Llocks that exceeu theii taiget ieplication loi the lile they Lelong to.
Ovei-ieplication is not noimally a pioLlem, anu HDFS will automatically uelete
excess ieplicas.
Undcr-rcp|icatcd b|oc|s
These aie Llocks that uo not meet theii taiget ieplication loi the lile they Lelong
to. HDFS will automatically cieate new ieplicas ol unuei-ieplicateu Llocks until
they meet the taiget ieplication. You can get inloimation aLout the Llocks Leing
ieplicateu (oi waiting to Le ieplicateu) using hadoop dfsadmin -metasave.
Misrcp|icatcd b|oc|s
These aie Llocks that uo not satisly the Llock ieplica placement policy (see ¨Replica
Placement¨ on page 7+). Foi example, loi a ieplication level ol thiee in a multiiack
clustei, il all thiee ieplicas ol a Llock aie on the same iack, then the Llock is mis-
ieplicateu since the ieplicas shoulu Le spieau acioss at least two iacks loi
iesilience. HDFS will automatically ie-ieplicate misieplicateu Llocks so that they
satisly the iack placement policy.
Corrupt b|oc|s
These aie Llocks whose ieplicas aie all coiiupt. Blocks with at least one noncoiiupt
ieplica aie not iepoiteu as coiiupt; the namenoue will ieplicate the noncoiiupt
ieplica until the taiget ieplication is met.
Missing rcp|icas
These aie Llocks with no ieplicas anywheie in the clustei.
Coiiupt oi missing Llocks aie the Liggest cause loi concein, as it means uata has Leen
lost. By uelault, jsc| leaves liles with coiiupt oi missing Llocks, Lut you can tell it to
peiloim one ol the lollowing actions on them:
346 | Chapter 10: Administering Hadoop
º Movc the allecteu liles to the /|ost-jound uiiectoiy in HDFS, using the -move option.
Files aie Lioken into chains ol contiguous Llocks to aiu any salvaging elloits you
may attempt.
º Dc|ctc the allecteu liles, using the -delete option. Files cannot Le iecoveieu altei
Leing ueleteu.
The jsc| tool pioviues an easy way to linu out which Llocks aie
in any paiticulai lile. Foi example:
% hadoop fsck /user/tom/part-00007 -files -blocks -racks
/user/tom/part-00007 25582428 bytes, 1 block(s): OK
0. blk_-3724870485760122836_1035 len=25582428 repl=3 [/default-rack/,
/default-rack/, /default-rack/]
This says that the lile /uscr/ton/part-00007 is maue up ol one Llock anu shows the
uatanoues wheie the Llocks aie locateu. The jsc| options useu aie as lollows:
º The -files option shows the line with the lilename, size, numLei ol Llocks, anu
its health (whethei theie aie any missing Llocks).
º The -blocks option shows inloimation aLout each Llock in the lile, one line pei
º The -racks option uisplays the iack location anu the uatanoue auuiesses loi each
Running hadoop fsck without any aiguments uisplays lull usage instiuctions.
Datanode block scanner
Eveiy uatanoue iuns a Llock scannei, which peiiouically veiilies all the Llocks stoieu
on the uatanoue. This allows Lau Llocks to Le uetecteu anu lixeu Leloie they aie ieau
Ly clients. The DataBlockScanner maintains a list ol Llocks to veiily anu scans them one
Ly one loi checksum eiiois. The scannei employs a thiottling mechanism to pieseive
uisk Lanuwiuth on the uatanoue.
Blocks aie peiiouically veiilieu eveiy thiee weeks to guaiu against uisk eiiois ovei time
(this is contiolleu Ly the dfs.datanode.scan.period.hours piopeity, which uelaults to
50+ houis). Coiiupt Llocks aie iepoiteu to the namenoue to Le lixeu.
You can get a Llock veiilication iepoit loi a uatanoue Ly visiting the uatanoue`s weL
inteilace at http://datanodc:50075/b|oc|ScanncrRcport. Heie`s an example ol a iepoit,
which shoulu Le sell-explanatoiy:
Total Blocks : 21131
Verified in last hour : 70
Verified in last day : 1767
Verified in last week : 7360
Verified in last four weeks : 20057
Verified in SCAN_PERIOD : 20057
Not yet verified : 1074
Verified since restart : 35912
Finding the blocks for a file.
HDFS | 347
Scans since restart : 6541
Scan errors since restart : 0
Transient scan errors : 0
Current scan rate limit KBps : 1024
Progress this period : 109%
Time left in cur period : 53.08%
By specilying the listblocks paiametei, http://datanodc:50075/b|oc|ScanncrRcport
?|istb|oc|s, the iepoit is pieceueu Ly a list ol all the Llocks on the uatanoue along with
theii latest veiilication status. Heie is a snippet ol the Llock list (lines aie split to lit the
blk_6035596358209321442 : status : ok type : none scan time : 0
not yet verified
blk_3065580480714947643 : status : ok type : remote scan time : 1215755306400
2008-07-11 05:48:26,400
blk_8729669677359108508 : status : ok type : local scan time : 1215755727345
2008-07-11 05:55:27,345
The liist column is the Llock ID, lolloweu Ly some key-value paiis. The status can Le
one ol failed oi ok accoiuing to whethei the last scan ol the Llock uetecteu a checksum
eiioi. The type ol scan is local il it was peiloimeu Ly the Lackgiounu thieau, remote
il it was peiloimeu Ly a client oi a iemote uatanoue, oi none il a scan ol this Llock has
yet to Le maue. The last piece ol inloimation is the scan time, which is uisplayeu as the
numLei ol milliseconus since miunight 1 ]anuaiy 1970, anu also as a moie ieauaLle
Ovei time, the uistiiLution ol Llocks acioss uatanoues can Lecome unLalanceu. An
unLalanceu clustei can allect locality loi MapReuuce, anu it puts a gieatei stiain on
the highly utilizeu uatanoues, so it`s Lest avoiueu.
The ba|anccr piogiam is a Hauoop uaemon that ie-uistiiLutes Llocks Ly moving them
liom ovei-utilizeu uatanoues to unuei-utilizeu uatanoues, while auheiing to the Llock
ieplica placement policy that makes uata loss unlikely Ly placing Llock ieplicas on
uilleient iacks (see ¨Replica Placement¨ on page 7+). It moves Llocks until the clustei
is ueemeu to Le Lalanceu, which means that the utilization ol eveiy uatanoue (iatio ol
useu space on the noue to total capacity ol the noue) uilleis liom the utilization ol the
clustei (iatio ol useu space on the clustei to total capacity ol the clustei) Ly no moie
than a given thiesholu peicentage. You can stait the Lalancei with:
% start-balancer.sh
The -threshold aigument specilies the thiesholu peicentage that uelines what it means
loi the clustei to Le Lalanceu. The llag is optional, in which case the thiesholu is 10º.
At any one time, only one Lalancei may Le iunning on the clustei.
The Lalancei iuns until the clustei is Lalanceu; it cannot move any moie Llocks, oi it
loses contact with the namenoue. It piouuces a loglile in the stanuaiu log uiiectoiy,
348 | Chapter 10: Administering Hadoop
wheie it wiites a line loi eveiy iteiation ol ieuistiiLution that it caiiies out. Heie is the
output liom a shoit iun on a small clustei:
Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved
Mar 18, 2009 5:23:42 PM 0 0 KB 219.21 MB 150.29 MB
Mar 18, 2009 5:27:14 PM 1 195.24 MB 22.45 MB 150.29 MB
The cluster is balanced. Exiting...
Balancing took 6.072933333333333 minutes
The Lalancei is uesigneu to iun in the Lackgiounu without unuuly taxing the clustei
oi inteileiing with othei clients using the clustei. It limits the Lanuwiuth that it uses to
copy a Llock liom one noue to anothei. The uelault is a mouest 1 MB/s, Lut this can
Le changeu Ly setting the dfs.balance.bandwidthPerSec piopeity in hdjs-sitc.xn|, speci-
lieu in Lytes.
Monitoiing is an impoitant pait ol system auministiation. In this section, we look at
the monitoiing lacilities in Hauoop anu how they can hook into exteinal monitoiing
The puipose ol monitoiing is to uetect when the clustei is not pioviuing the expecteu
level ol seivice. The mastei uaemons aie the most impoitant to monitoi: the namenoues
(piimaiy anu seconuaiy) anu the joLtiackei. Failuie ol uatanoues anu tasktiackeis is
to Le expecteu, paiticulaily on laigei clusteis, so you shoulu pioviue extia capacity so
that the clustei can toleiate having a small peicentage ol ueau noues at any time.
In auuition to the lacilities uesciiLeu next, some auministiatois iun test joLs on a pe-
iiouic Lasis as a test ol the clustei`s health.
Theie is lot ol woik going on to auu moie monitoiing capaLilities to Hauoop, which
is not coveieu heie. Foi example, Chukwa
is a uata collection anu monitoiing system
Luilt on HDFS anu MapReuuce, anu excels at mining log uata loi linuing laige-scale
All Hauoop uaemons piouuce logliles that can Le veiy uselul loi linuing out what is
happening in the system. ¨System logliles¨ on page 307 explains how to conliguie these
Setting log levels
Vhen ueLugging a pioLlem, it is veiy convenient to Le aLle to change the log level
tempoiaiily loi a paiticulai component in the system.
2. http://hadoop.apachc.org/chu|wa
Monitoring | 349
Hauoop uaemons have a weL page loi changing the log level loi any log+j log name,
which can Le lounu at /|ogLcvc| in the uaemon`s weL UI. By convention, log names in
Hauoop coiiesponu to the classname uoing the logging, although theie aie exceptions
to this iule, so you shoulu consult the souice coue to linu log names.
Foi example, to enaLle ueLug logging loi the JobTracker class, we woulu visit the joL-
tiackei`s weL UI at http://jobtrac|cr-host:50030/|ogLcvc| anu set the log name
org.apache.hadoop.mapred.JobTracker to level DEBUG.
The same thing can Le achieveu liom the commanu line as lollows:
% hadoop daemonlog -setlevel jobtracker-host:50030 \
org.apache.hadoop.mapred.JobTracker DEBUG
Log levels changeu in this way aie ieset when the uaemon iestaits, which is usually
what you want. Howevei, to make a peisistent change to a log level, simply change the
|og1j.propcrtics lile in the conliguiation uiiectoiy. In this case, the line to auu is:
Getting stack traces
Hauoop uaemons expose a weL page (/stac|s in the weL UI) that piouuces a thieau
uump loi all iunning thieaus in the uaemon`s ]VM. Foi example, you can get a thieau
uump loi a joLtiackei liom http://jobtrac|cr-host:50030/stac|s.
The HDFS anu MapReuuce uaemons collect inloimation aLout events anu measuie-
ments that aie collectively known as nctrics. Foi example, uatanoues collect the lol-
lowing metiics (anu many moie): the numLei ol Lytes wiitten, the numLei ol Llocks
ieplicateu, anu the numLei ol ieau ieguests liom clients (Loth local anu iemote).
Metiics Lelong to a contcxt, anu Hauoop cuiiently uses ¨uls¨, ¨mapieu¨, ¨ipc¨, anu
¨jvm¨ contexts. Hauoop uaemons usually collect metiics unuei seveial contexts. Foi
example, uatanoues collect metiics loi the ¨uls¨, ¨ipc¨, anu ¨jvm¨ contexts.
How Do Metrics Differ from Counters?
The main uilleience is theii scope: metiics aie collecteu Ly Hauoop uaemons, wheieas
counteis (see ¨Counteis¨ on page 257) aie collecteu loi MapReuuce tasks anu aggie-
gateu loi the whole joL. They have uilleient auuiences, too: Lioauly speaking, metiics
aie loi auministiatois, anu counteis aie loi MapReuuce useis.
The way they aie collecteu anu aggiegateu is also uilleient. Counteis aie a MapReuuce
leatuie, anu the MapReuuce system ensuies that countei values aie piopagateu liom
the tasktiackeis wheie they aie piouuceu, Lack to the joLtiackei, anu linally Lack to
the client iunning the MapReuuce joL. (Counteis aie piopagateu via RPC heaitLeats;
see ¨Piogiess anu Status Upuates¨ on page 192.) Both the tasktiackeis anu the joL-
tiackei peiloim aggiegation.
350 | Chapter 10: Administering Hadoop
The collection mechanism loi metiics is uecoupleu liom the component that ieceives
the upuates, anu theie aie vaiious pluggaLle outputs, incluuing local liles, Ganglia, anu
]MX. The uaemon collecting the metiics peiloims aggiegation on them Leloie they aie
sent to the output.
A context uelines the unit ol puLlication; you can choose to puLlish the ¨uls¨ context,
Lut not the ¨jvm¨ context, loi instance. Metiics aie conliguieu in the conj/hadoop-
nctrics.propcrtics lile, anu, Ly uelault, all contexts aie conliguieu so they uo not puLlish
theii metiics. This is the contents ol the uelault conliguiation lile (minus the
Each line in this lile conliguies a uilleient context anu specilies the class that hanules
the metiics loi that context. The class must Le an implementation ol the MetricsCon
text inteilace; anu, as the name suggests, the NullContext class neithei puLlishes noi
upuates metiics.
The othei implementations ol MetricsContext aie coveieu in the lollowing sections.
You can view iaw metiics gatheieu Ly a paiticulai Hauoop uaemon Ly connecting to
its /nctrics weL page. This is hanuy loi ueLugging. Foi example, you can view joL-
tiackei metiics in plain text at http://jobtrac|cr-host:50030/nctrics. To ietiieve metiics
in ]SON loimat you woulu use http://jobtrac|cr-host:50030/nctrics?jornat=json.
FileContext wiites metiics to a local lile. It exposes two conliguiation piopeities:
fileName, which specilies the aLsolute name ol the lile to wiite to, anu period, loi the
time inteival (in seconus) Letween lile upuates. Both piopeities aie optional; il not set,
the metiics will Le wiitten to stanuaiu output eveiy live seconus.
Conliguiation piopeities apply to a context name anu aie specilieu Ly appenuing the
piopeity name to the context name (sepaiateu Ly a uot). Foi example, to uump the
¨jvm¨ context to a lile, we altei its conliguiation to Le the lollowing:
In the liist line, we have changeu the ¨jvm¨ context to use a FileContext, anu in the
seconu, we have set the ¨jvm¨ context`s fileName piopeity to Le a tempoiaiy lile. Heie
aie two lines ol output liom the loglile, split ovei seveial lines to lit the page:
3. The teim ¨context¨ is (peihaps unloitunately) oveiloaueu heie, since it can ielei to eithei a collection ol
metiics (the ¨uls¨ context, loi example) oi the class that puLlishes metiics (the NullContext, loi example).
Monitoring | 351
jvm.metrics: hostName=ip-10-250-59-159, processName=NameNode, sessionId=,
gcCount=46, gcTimeMillis=394, logError=0, logFatal=0, logInfo=59, logWarn=1,
memHeapCommittedM=4.9375, memHeapUsedM=2.5322647, memNonHeapCommittedM=18.25,
memNonHeapUsedM=11.330269, threadsBlocked=0, threadsNew=0, threadsRunnable=6,
threadsTerminated=0, threadsTimedWaiting=8, threadsWaiting=13
jvm.metrics: hostName=ip-10-250-59-159, processName=SecondaryNameNode, sessionId=,
gcCount=36, gcTimeMillis=261, logError=0, logFatal=0, logInfo=18, logWarn=4,
memHeapCommittedM=5.4414062, memHeapUsedM=4.46756, memNonHeapCommittedM=18.25,
memNonHeapUsedM=10.624519, threadsBlocked=0, threadsNew=0, threadsRunnable=5,
threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2
FileContext can Le uselul on a local system loi ueLugging puiposes, Lut is unsuitaLle
on a laigei clustei since the output liles aie spieau acioss the clustei, which makes
analyzing them uillicult.
Ganglia (http://gang|ia.injo/) is an open souice uistiiLuteu monitoiing system loi veiy
laige clusteis. It is uesigneu to impose veiy low iesouice oveiheaus on each noue in the
clustei. Ganglia itsell collects metiics, such as CPU anu memoiy usage; Ly using
GangliaContext, you can inject Hauoop metiics into Ganglia.
GangliaContext has one ieguiieu piopeity, servers, which takes a space- anu/oi
comma-sepaiateu list ol Ganglia seivei host-poit paiis. Fuithei uetails on conliguiing
this context can Le lounu on the Hauoop wiki.
Foi a llavoi ol the kinu ol inloimation you can get out ol Ganglia, see Figuie 10-2,
which shows how the numLei ol tasks in the joLtiackei`s gueue vaiies ovei time.
Both FileContext anu a GangliaContext push metiics to an exteinal system. Howevei,
some monitoiing systems÷notaLly ]MX÷neeu to pull metiics liom Hauoop. Null
ContextWithUpdateThread is uesigneu loi this. Like NullContext, it uoesn`t puLlish any
metiics, Lut in auuition it iuns a timei that peiiouically upuates the metiics stoieu in
memoiy. This ensuies that the metiics aie up-to-uate when they aie letcheu Ly anothei
Iigurc 10-2. Gang|ia p|ot oj nunbcr oj tas|s in thc jobtrac|cr qucuc
352 | Chapter 10: Administering Hadoop
All implementations ol MetricsContext, except NullContext, peiloim this upuating
lunction (anu they all expose a period piopeity that uelaults to live seconus), so you
neeu to use NullContextWithUpdateThread only il you aie not collecting metiics using
anothei output. Il you weie using GangliaContext, loi example, then it woulu ensuie
the metiics aie upuateu, so you woulu Le aLle to use ]MX in auuition with no luithei
conliguiation ol the metiics system. ]MX is uiscusseu in moie uetail shoitly.
CompositeContext allows you to output the same set ol metiics to multiple contexts,
such as a FileContext anu a GangliaContext. The conliguiation is slightly tiicky anu is
Lest shown Ly an example:
The arity piopeity is useu to specily the numLei ol suLcontexts; in this case, theie aie
two. The piopeity names loi each suLcontext aie mouilieu to have a pait specilying
the suLcontext numLei, hence jvm.sub1.class anu jvm.sub2.class.
Java Management Extensions
]ava Management Extensions (]MX) is a stanuaiu ]ava API loi monitoiing anu man-
aging applications. Hauoop incluues seveial manageu Leans (MBeans), which expose
Hauoop metiics to ]MX-awaie applications. Theie aie MBeans that expose the metiics
in the ¨uls¨ anu ¨ipc¨ contexts, Lut none loi the ¨mapieu¨ context (at the time ol this
wiiting) oi the ¨jvm¨ context (as the ]VM itsell exposes a iichei set ol ]VM metiics).
These MBeans aie listeu in TaLle 10-3.
Tab|c 10-3. Hadoop MBcans
MBean class Daemons Metrics
NameNodeActivityMBean Namenode Namenode activity metrics, such as the
number of create file operations
FSNamesystemMBean Namenode Namenode status metrics, such as the
number of connected datanodes
DataNodeActivityMBean Datanode Datanode activity metrics, such as num-
ber of bytes read
FSDatasetMBean Datanode Datanode storage metrics, such as
capacity and free storage space
RpcActivityMBean All daemons that use RPC:
namenode, datanode,
jobtracker, tasktracker
RPC statistics, such as average process-
ing time
Monitoring | 353
The ]DK comes with a tool calleu ]Console loi viewing MBeans in a iunning ]VM. It`s
uselul loi Liowsing Hauoop metiics, as uemonstiateu in Figuie 10-3.
Iigurc 10-3. jConso|c vicw oj a |oca||y running nancnodc, showing nctrics jor thc ji|csystcn statc
Although you can see Hauoop metiics via ]MX using the uelault metiics
conliguiation, they will not Le upuateu unless you change the
MetricsContext implementation to something othei than NullContext.
Foi example, NullContextWithUpdateThread is appiopiiate il ]MX is the
only way you will Le monitoiing metiics.
Many thiiu-paity monitoiing anu aleiting systems (such as Nagios oi Hypeiic) can
gueiy MBeans, making ]MX the natuial way to monitoi youi Hauoop clustei liom an
existing monitoiing system. You will neeu to enaLle iemote access to ]MX, howevei,
anu choose a level ol secuiity that is appiopiiate loi youi clustei. The options heie
incluue passwoiu authentication, SSL connections, anu SSL client-authentication. See
the ollicial ]ava uocumentation
loi an in-uepth guiue on conliguiing these options.
All the options loi enaLling iemote access to ]MX involve setting ]ava system piopei-
ties, which we uo loi Hauoop Ly euiting the conj/hadoop-cnv.sh lile. The lollowing
conliguiation settings show how to enaLle passwoiu-authenticateu iemote access to
]MX on the namenoue (with SSL uisaLleu). The piocess is veiy similai loi othei Hauoop
+. http://java.sun.con/javasc/ó/docs/tcchnotcs/guidcs/nanagcncnt/agcnt.htn|
354 | Chapter 10: Administering Hadoop
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=8004 $HADOOP_NAMENODE_OPTS"
The jnxrcnotc.password lile lists the useinames anu theii passwoius in plain text; the
]MX uocumentation has luithei uetails on the loimat ol this lile.
Vith this conliguiation, we can use ]Console to Liowse MBeans on a iemote name-
noue. Alteinatively, we can use one ol the many ]MX tools to ietiieve MBean attiiLute
values. Heie is an example ol using the ¨jmxgueiy¨ commanu-line tool (anu Nagios
plug-in, availaLle liom http://codc.goog|c.con/p/jnxqucry/) to ietiieve the numLei ol
unuei-ieplicateu Llocks:
% ./check_jmx -U service:jmx:rmi:///jndi/rmi://namenode-host:8004/jmxrmi -O \
hadoop:service=NameNode,name=FSNamesystemState -A UnderReplicatedBlocks \
-w 100 -c 1000 -username monitorRole -password secret
JMX OK - UnderReplicatedBlocks is 0
This commanu estaLlishes a ]MX RMI connection to the host nancnodc-host on poit
S00+ anu authenticates using the given useiname anu passwoiu. It ieaus the attiiLute
UnderReplicatedBlocks ol the oLject nameu hadoop:service=NameNode,name=FSNamesys
temState anu piints out its value on the console.
The -w anu -c options specily waining
anu ciitical levels loi the value: the appiopiiate values ol these aie noimally ueteimineu
altei opeiating a clustei loi a while.
It`s common to use Ganglia in conjunction with an aleiting system like Nagios loi
monitoiing a Hauoop clustei. Ganglia is goou loi elliciently collecting a laige numLei
ol metiics anu giaphing them, wheieas Nagios anu similai systems aie goou at senuing
aleits when a ciitical thiesholu is ieacheu in any ol a smallei set ol metiics.
Routine Administration Procedures
Metadata backups
Il the namenoue`s peisistent metauata is lost oi uamageu, the entiie lilesystem is ien-
ueieu unusaLle, so it is ciitical that Lackups aie maue ol these liles. You shoulu keep
multiple copies ol uilleient ages (one houi, one uay, one week, anu one month, say) to
piotect against coiiuption, eithei in the copies themselves oi in the live liles iunning
on the namenoue.
5. It`s convenient to use ]Console to linu the oLject names ol the MBeans that you want to monitoi. Note
that MBeans loi uatanoue metiics contain a ianuom iuentiliei in Hauoop 0.20, which makes it uillicult
to monitoi them in anything Lut an au hoc way. This was lixeu in Hauoop 0.21.0.
Maintenance | 355
A stiaightloiwaiu way to make Lackups is to wiite a sciipt to peiiouically aichive the
seconuaiy namenoue`s prcvious.chcc|point suLuiiectoiy (unuei the uiiectoiy uelineu
Ly the fs.checkpoint.dir piopeity) to an ollsite location. The sciipt shoulu auuitionally
test the integiity ol the copy. This can Le uone Ly staiting a local namenoue uaemon
anu veiilying that it has successlully ieau the jsinagc anu cdits liles into memoiy (Ly
scanning the namenoue log loi the appiopiiate success message, loi example).
Data backups
Although HDFS is uesigneu to stoie uata ieliaLly, uata loss can occui, just like in any
stoiage system, anu thus a Lackup stiategy is essential. Vith the laige uata volumes
that Hauoop can stoie, ueciuing what uata to Lack up anu wheie to stoie it is a chal-
lenge. The key heie is to piioiitize youi uata. The highest piioiity is the uata that cannot
Le iegeneiateu anu that is ciitical to the Lusiness; howevei, uata that is stiaightloiwaiu
to iegeneiate, oi essentially uisposaLle Lecause it is ol limiteu Lusiness value, is the
lowest piioiity, anu you may choose not to make Lackups ol this categoiy ol uata.
Do not make the mistake ol thinking that HDFS ieplication is a suLsti-
tute loi making Lackups. Bugs in HDFS can cause ieplicas to Le lost,
anu so can haiuwaie lailuies. Although Hauoop is expiessly uesigneu
so that haiuwaie lailuie is veiy unlikely to iesult in uata loss, the pos-
siLility can nevei Le completely iuleu out, paiticulaily when comLineu
with soltwaie Lugs oi human eiioi.
Vhen it comes to Lackups, think ol HDFS in the same way as you woulu
RAID. Although the uata will suivive the loss ol an inuiviuual RAID
uisk, it may not il the RAID contiollei lails, oi is Luggy (peihaps ovei-
wiiting some uata), oi the entiie aiiay is uamageu.
It`s common to have a policy loi usei uiiectoiies in HDFS. Foi example, they may have
space guotas anu Le Lackeu up nightly. Vhatevei the policy, make suie youi useis
know what it is, so they know what to expect.
The distcp tool is iueal loi making Lackups to othei HDFS clusteis (pieleiaLly iunning
on a uilleient veision ol the soltwaie, to guaiu against loss uue to Lugs in HDFS) oi
othei Hauoop lilesystems (such as S3 oi KFS), since it can copy liles in paiallel. Altei-
natively, you can employ an entiiely uilleient stoiage system loi Lackups, using one ol
the ways to expoit uata liom HDFS uesciiLeu in ¨Hauoop Filesystems¨ on page 5+.
6. Hauoop 0.23.0 comes with an Ollline Image Viewei anu Ollline Euits Viewei, which can Le useu to check
the integiity ol the image anu euits liles. Note that Loth vieweis suppoit oluei loimats ol these liles, so
you can use them to uiagnose pioLlems in these liles geneiateu Ly pievious ieleases ol Hauoop. Type
hdfs oiv anu hdfs oev to invoke these tools.
356 | Chapter 10: Administering Hadoop
Filesystem check (fsck)
It is auvisaLle to iun HDFS`s jsc| tool iegulaily (loi example, uaily) on the whole lile-
system to pioactively look loi missing oi coiiupt Llocks. See ¨Filesystem check
(lsck)¨ on page 3+5.
Filesystem balancer
Run the Lalancei tool (see ¨Lalancei¨ on page 3+S) iegulaily to keep the lilesystem
uatanoues evenly Lalanceu.
Commissioning and Decommissioning Nodes
As an auministiatoi ol a Hauoop clustei, you will neeu to auu oi iemove noues liom
time to time. Foi example, to giow the stoiage availaLle to a clustei, you commission
new noues. Conveisely, sometimes you may wish to shiink a clustei, anu to uo so, you
uecommission noues. It can sometimes Le necessaiy to uecommission a noue il it is
misLehaving, peihaps Lecause it is lailing moie olten than it shoulu oi its peiloimance
is noticeaLly slow.
Noues noimally iun Loth a uatanoue anu a tasktiackei, anu Loth aie typically
commissioneu oi uecommissioneu in tanuem.
Commissioning new nodes
Although commissioning a new noue can Le as simple as conliguiing the hdjs-
sitc.xn| lile to point to the namenoue anu the naprcd-sitc.xn| lile to point to the joL-
tiackei, anu staiting the uatanoue anu joLtiackei uaemons, it is geneially Lest to have
a list ol authoiizeu noues.
It is a potential secuiity iisk to allow any machine to connect to the namenoue anu act
as a uatanoue, since the machine may gain access to uata that it is not authoiizeu to
see. Fuitheimoie, since such a machine is not a ieal uatanoue, it is not unuei youi
contiol, anu may stop at any time, causing potential uata loss. (Imagine what woulu
happen il a numLei ol such noues weie connecteu, anu a Llock ol uata was piesent
only on the ¨alien¨ noues?) This scenaiio is a iisk even insiue a liiewall, thiough
misconliguiation, so uatanoues (anu tasktiackeis) shoulu Le explicitly manageu on all
piouuction clusteis.
Datanoues that aie peimitteu to connect to the namenoue aie specilieu in a lile whose
name is specilieu Ly the dfs.hosts piopeity. The lile iesiues on the namenoue`s local
lilesystem, anu it contains a line loi each uatanoue, specilieu Ly netwoik auuiess (as
iepoiteu Ly the uatanoue÷you can see what this is Ly looking at the namenoue`s weL
UI). Il you neeu to specily multiple netwoik auuiesses loi a uatanoue, put them on one
line, sepaiateu Ly whitespace.
Maintenance | 357
Similaily, tasktiackeis that may connect to the joLtiackei aie specilieu in a lile whose
name is specilieu Ly the mapred.hosts piopeity. In most cases, theie is one shaieu lile,
ieleiieu to as the inc|udc ji|c, that Loth dfs.hosts anu mapred.hosts ielei to, since noues
in the clustei iun Loth uatanoue anu tasktiackei uaemons.
The lile (oi liles) specilieu Ly the dfs.hosts anu mapred.hosts piopeities
is uilleient liom the s|avcs lile. The loimei is useu Ly the namenoue anu
joLtiackei to ueteimine which woikei noues may connect. The s|avcs
lile is useu Ly the Hauoop contiol sciipts to peiloim clustei-wiue op-
eiations, such as clustei iestaits. It is nevei useu Ly the Hauoop
To auu new noues to the clustei:
1. Auu the netwoik auuiesses ol the new noues to the incluue lile.
2. Upuate the namenoue with the new set ol peimitteu uatanoues using this
% hadoop dfsadmin -refreshNodes
3. Upuate the joLtiackei with the new set ol peimitteu tasktiackeis using:
% hadoop mradmin -refreshNodes
+. Upuate the s|avcs lile with the new noues, so that they aie incluueu in lutuie op-
eiations peiloimeu Ly the Hauoop contiol sciipts.
5. Stait the new uatanoues anu tasktiackeis.
6. Check that the new uatanoues anu tasktiackeis appeai in the weL UI.
HDFS will not move Llocks liom olu uatanoues to new uatanoues to Lalance the clustei.
To uo this, you shoulu iun the Lalancei uesciiLeu in ¨Lalancei¨ on page 3+S.
Decommissioning old nodes
Although HDFS is uesigneu to toleiate uatanoue lailuies, this uoes not mean you can
just teiminate uatanoues en masse with no ill ellect. Vith a ieplication level ol thiee,
loi example, the chances aie veiy high that you will lose uata Ly simultaneously shutting
uown thiee uatanoues il they aie on uilleient iacks. The way to uecommission
uatanoues is to inloim the namenoue ol the noues that you wish to take out ol ciicu-
lation, so that it can ieplicate the Llocks to othei uatanoues Leloie the uatanoues aie
shut uown.
Vith tasktiackeis, Hauoop is moie loigiving. Il you shut uown a tasktiackei that is
iunning tasks, the joLtiackei will notice the lailuie anu iescheuule the tasks on othei
The uecommissioning piocess is contiolleu Ly an cxc|udc ji|c, which loi HDFS is set
Ly the dfs.hosts.exclude piopeity anu loi MapReuuce Ly the mapred.hosts.exclude
358 | Chapter 10: Administering Hadoop
piopeity. It is olten the case that these piopeities ielei to the same lile. The excluue lile
lists the noues that aie not peimitteu to connect to the clustei.
The iules loi whethei a tasktiackei may connect to the joLtiackei aie simple: a task-
tiackei may connect only il it appeais in the incluue lile anu uoes not appeai in the
excluue lile. An unspecilieu oi empty incluue lile is taken to mean that all noues aie in
the incluue lile.
Foi HDFS, the iules aie slightly uilleient. Il a uatanoue appeais in Loth the incluue anu
the excluue lile, then it may connect, Lut only to Le uecommissioneu. TaLle 10-+ sum-
maiizes the uilleient comLinations loi uatanoues. As loi tasktiackeis, an unspecilieu
oi empty incluue lile means all noues aie incluueu.
Tab|c 10-1. HDIS inc|udc and cxc|udc ji|c prcccdcncc
Node appears in include file Node appears in exclude file Interpretation
No No Node may not connect.
No Yes Node may not connect.
Yes No Node may connect.
Yes Yes Node may connect and will be decommissioned.
To iemove noues liom the clustei:
1. Auu the netwoik auuiesses ol the noues to Le uecommissioneu to the excluue lile.
Do not upuate the incluue lile at this point.
2. Upuate the namenoue with the new set ol peimitteu uatanoues, with this
% hadoop dfsadmin -refreshNodes
3. Upuate the joLtiackei with the new set ol peimitteu tasktiackeis using:
% hadoop mradmin -refreshNodes
+. Go to the weL UI anu check whethei the aumin state has changeu to ¨Decommis-
sion In Piogiess¨ loi the uatanoues Leing uecommissioneu. They will stait copying
theii Llocks to othei uatanoues in the clustei.
5. Vhen all the uatanoues iepoit theii state as ¨Decommissioneu,¨ then all the Llocks
have Leen ieplicateu. Shut uown the uecommissioneu noues.
6. Remove the noues liom the incluue lile, anu iun:
% hadoop dfsadmin -refreshNodes
% hadoop mradmin -refreshNodes
7. Remove the noues liom the s|avcs lile.
Maintenance | 359
Upgiauing an HDFS anu MapReuuce clustei ieguiies caielul planning. The most im-
poitant consiueiation is the HDFS upgiaue. Il the layout veision ol the lilesystem has
changeu, then the upgiaue will automatically migiate the lilesystem uata anu metauata
to a loimat that is compatiLle with the new veision. As with any pioceuuie that involves
uata migiation, theie is a iisk ol uata loss, so you shoulu Le suie that Loth youi uata
anu metauata is Lackeu up (see ¨Routine Auministiation Pioceuuies¨ on page 355).
Pait ol the planning piocess shoulu incluue a tiial iun on a small test clustei with a
copy ol uata that you can alloiu to lose. A tiial iun will allow you to lamiliaiize youisell
with the piocess, customize it to youi paiticulai clustei conliguiation anu toolset, anu
iion out any snags Leloie iunning the upgiaue pioceuuie on a piouuction clustei. A
test clustei also has the Lenelit ol Leing availaLle to test client upgiaues on. You can
ieau aLout geneial compatiLility conceins loi clients in ¨CompatiLility¨ on page 15.
Upgiauing a clustei when the lilesystem layout has not changeu is laiily
stiaightloiwaiu: install the new veisions ol HDFS anu MapReuuce on the clustei (anu
on clients at the same time), shut uown the olu uaemons, upuate conliguiation liles,
then stait up the new uaemons anu switch clients to use the new liLiaiies. This piocess
is ieveisiLle, so iolling Lack an upgiaue is also stiaightloiwaiu.
Altei eveiy successlul upgiaue, you shoulu peiloim a couple ol linal cleanup steps:
º Remove the olu installation anu conliguiation liles liom the clustei.
º Fix any uepiecation wainings in youi coue anu conliguiation.
HDFS data and metadata upgrades
Il you use the pioceuuie just uesciiLeu to upgiaue to a new veision ol HDFS anu it
expects a uilleient layout veision, then the namenoue will ieluse to iun. A message like
the lollowing will appeai in its log:
File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
The most ieliaLle way ol linuing out whethei you neeu to upgiaue the lilesystem is Ly
peiloiming a tiial on a test clustei.
An upgiaue ol HDFS makes a copy ol the pievious veision`s metauata anu uata. Doing
an upgiaue uoes not uouLle the stoiage ieguiiements ol the clustei, as the uatanoues
use haiu links to keep two ieleiences (loi the cuiient anu pievious veision) to the same
Llock ol uata. This uesign makes it stiaightloiwaiu to ioll Lack to the pievious veision
ol the lilesystem, shoulu you neeu to. You shoulu unueistanu that any changes maue
to the uata on the upgiaueu system will Le lost altei the iollLack completes.
You can keep only the pievious veision ol the lilesystem: you can`t ioll Lack seveial
veisions. Theieloie, to caiiy out anothei upgiaue to HDFS uata anu metauata, you will
360 | Chapter 10: Administering Hadoop
neeu to uelete the pievious veision, a piocess calleu jina|izing thc upgradc. Once an
upgiaue is linalizeu, theie is no pioceuuie loi iolling Lack to a pievious veision.
In geneial, you can skip ieleases when upgiauing (loi example, you can upgiaue liom
ielease 0.1S.3 to 0.20.0 without having to upgiaue to a 0.19.x ielease liist), Lut in some
cases, you may have to go thiough inteimeuiate ieleases. The ielease notes make it cleai
when this is ieguiieu.
You shoulu only attempt to upgiaue a healthy lilesystem. Beloie iunning the upgiaue,
uo a lull jsc| (see ¨Filesystem check (lsck)¨ on page 3+5). As an extia piecaution, you
can keep a copy ol the jsc| output that lists all the liles anu Llocks in the system, so
you can compaie it with the output ol iunning jsc| altei the upgiaue.
It`s also woith cleaiing out tempoiaiy liles Leloie uoing the upgiaue, Loth liom the
MapReuuce system uiiectoiy on HDFS anu local tempoiaiy liles.
Vith these pieliminaiies out ol the way, heie is the high-level pioceuuie loi upgiauing
a clustei when the lilesystem layout neeus to Le migiateu:
1. Make suie that any pievious upgiaue is linalizeu Leloie pioceeuing with anothei
2. Shut uown MapReuuce anu kill any oiphaneu task piocesses on the tasktiackeis.
3. Shut uown HDFS anu Lackup the namenoue uiiectoiies.
+. Install new veisions ol Hauoop HDFS anu MapReuuce on the clustei anu on
5. Stait HDFS with the -upgrade option.
6. Vait until the upgiaue is complete.
7. Peiloim some sanity checks on HDFS.
S. Stait MapReuuce.
9. Roll Lack oi linalize the upgiaue (optional).
Vhile iunning the upgiaue pioceuuie, it is a goou iuea to iemove the Hauoop sciipts
liom youi PATH enviionment vaiiaLle. This loices you to Le explicit aLout which veision
ol the sciipts you aie iunning. It can Le convenient to ueline two enviionment vaiiaLles
loi the new installation uiiectoiies; in the lollowing instiuctions, we have uelineu
Maintenance | 361
To peiloim the upgiaue, iun the lollowing commanu (this is step 5 in
the high-level upgiaue pioceuuie):
% $NEW_HADOOP_INSTALL/bin/start-dfs.sh -upgrade
This causes the namenoue to upgiaue its metauata, placing the pievious veision in a
new uiiectoiy calleu prcvious:
Similaily, uatanoues upgiaue theii stoiage uiiectoiies, pieseiving the olu copy in a
uiiectoiy calleu prcvious.
The upgiaue piocess is not instantaneous, Lut you can
check the piogiess ol an upgiaue using djsadnin (upgiaue events also appeai in the
uaemons` logliles, step 6):
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
Upgrade for version -18 has been completed.
Upgrade is not finalized.
This shows that the upgiaue is complete. At this stage, you shoulu iun
some sanity checks (step 7) on the lilesystem (check liles anu Llocks using jsc|, Lasic
lile opeiations). You might choose to put HDFS into sale moue while you aie iunning
some ol these checks (the ones that aie ieau-only) to pievent otheis liom making
Il you linu that the new veision is not woiking coiiectly,
you may choose to ioll Lack to the pievious veision (step 9). This is only possiLle il
you have not linalizeu the upgiaue.
A iollLack ieveits the lilesystem state to Leloie the upgiaue was pei-
loimeu, so any changes maue in the meantime will Le lost. In othei
woius, it iolls Lack to the pievious state ol the lilesystem, iathei than
uowngiauing the cuiient state ol the lilesystem to a loimei veision.
Fiist, shut uown the new uaemons:
% $NEW_HADOOP_INSTALL/bin/stop-dfs.sh
Then stait up the olu veision ol HDFS with the -rollback option:
% $OLD_HADOOP_INSTALL/bin/start-dfs.sh -rollback
Start the upgrade.
Wait until the upgrade is complete.
Check the upgrade.
Roll back the upgrade (optional).
362 | Chapter 10: Administering Hadoop
This commanu gets the namenoue anu uatanoues to ieplace theii cuiient stoiage
uiiectoiies with theii pievious copies. The lilesystem will Le ietuineu to its pievious
Vhen you aie happy with the new veision ol HDFS, you
can linalize the upgiaue (step 9) to iemove the pievious stoiage uiiectoiies.
Altei an upgiaue has Leen linalizeu, theie is no way to ioll Lack to the
pievious veision.
This step is ieguiieu Leloie peiloiming anothei upgiaue:
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -finalizeUpgrade
% $NEW_HADOOP_INSTALL/bin/hadoop dfsadmin -upgradeProgress status
There are no upgrades in progress.
HDFS is now lully upgiaueu to the new veision.
Finalize the upgrade (optional).
Maintenance | 363
Pig iaises the level ol aLstiaction loi piocessing laige uatasets. MapReuuce allows you
the piogiammei to specily a map lunction lolloweu Ly a ieuuce lunction, Lut woiking
out how to lit youi uata piocessing into this pattein, which olten ieguiies multiple
MapReuuce stages, can Le a challenge. Vith Pig, the uata stiuctuies aie much iichei,
typically Leing multivalueu anu nesteu; anu the set ol tiansloimations you can apply
to the uata aie much moie poweilul÷they incluue joins, loi example, which aie not
loi the laint ol heait in MapReuuce.
Pig is maue up ol two pieces:
º The language useu to expiess uata llows, calleu Pig Latin.
º The execution enviionment to iun Pig Latin piogiams. Theie aie cuiiently two
enviionments: local execution in a single ]VM anu uistiiLuteu execution on a Ha-
uoop clustei.
A Pig Latin piogiam is maue up ol a seiies ol opeiations, oi tiansloimations, that aie
applieu to the input uata to piouuce output. Taken as a whole, the opeiations uesciiLe
a uata llow, which the Pig execution enviionment tianslates into an executaLle iepie-
sentation anu then iuns. Unuei the coveis, Pig tuins the tiansloimations into a seiies
ol MapReuuce joLs, Lut as a piogiammei you aie mostly unawaie ol this, which allows
you to locus on the uata iathei than the natuie ol the execution.
Pig is a sciipting language loi exploiing laige uatasets. One ciiticism ol MapReuuce is
that the uevelopment cycle is veiy long. Viiting the mappeis anu ieuuceis, compiling
anu packaging the coue, suLmitting the joL(s), anu ietiieving the iesults is a time-
consuming Lusiness, anu even with Stieaming, which iemoves the compile anu package
step, the expeiience is still involveu. Pig`s sweet spot is its aLility to piocess teiaLytes
ol uata simply Ly issuing a hall-uozen lines ol Pig Latin liom the console. Inueeu, it
was cieateu at Yahoo! to make it easiei loi ieseaicheis anu engineeis to mine the huge
uatasets theie. Pig is veiy suppoitive ol a piogiammei wiiting a gueiy, since it pioviues
seveial commanus loi intiospecting the uata stiuctuies in youi piogiam, as it is wiitten.
Even moie uselul, it can peiloim a sample iun on a iepiesentative suLset ol youi input
uata, so you can see whethei theie aie eiiois in the piocessing Leloie unleashing it on
the lull uataset.
Pig was uesigneu to Le extensiLle. Viitually all paits ol the piocessing path aie cus-
tomizaLle: loauing, stoiing, lilteiing, giouping, anu joining can all Le alteieu Ly usei-
uelineu lunctions (UDFs). These lunctions opeiate on Pig`s nesteu uata mouel, so they
can integiate veiy ueeply with Pig`s opeiatois. As anothei Lenelit, UDFs tenu to Le
moie ieusaLle than the liLiaiies uevelopeu loi wiiting MapReuuce piogiams.
Pig isn`t suitaLle loi all uata piocessing tasks, howevei. Like MapReuuce, it is uesigneu
loi Latch piocessing ol uata. Il you want to peiloim a gueiy that touches only a small
amount ol uata in a laige uataset, then Pig will not peiloim well, since it is set up to
scan the whole uataset, oi at least laige poitions ol it.
In some cases, Pig uoesn`t peiloim as well as piogiams wiitten in MapReuuce. How-
evei, the gap is naiiowing with each ielease, as the Pig team implements sophisticateu
algoiithms loi implementing Pig`s ielational opeiatois. It`s laii to say that unless you
aie willing to invest a lot ol elloit optimizing ]ava MapReuuce coue, wiiting gueiies in
Pig Latin will save you time.
Installing and Running Pig
Pig iuns as a client-siue application. Even il you want to iun Pig on a Hauoop clustei,
theie is nothing extia to install on the clustei: Pig launches joLs anu inteiacts with
HDFS (oi othei Hauoop lilesystems) liom youi woikstation.
Installation is stiaightloiwaiu. ]ava 6 is a pieieguisite (anu on Vinuows, you will neeu
Cygwin). Downloau a staLle ielease liom http://pig.apachc.org/rc|cascs.htn|, anu un-
pack the taiLall in a suitaLle place on youi woikstation:
% tar xzf pig-x.y.z.tar.gz
It`s convenient to auu Pig`s Linaiy uiiectoiy to youi commanu-line path. Foi example:
% export PIG_INSTALL=/home/tom/pig-x.y.z
% export PATH=$PATH:$PIG_INSTALL/bin
You also neeu to set the JAVA_HOME enviionment vaiiaLle to point to a suitaLle ]ava
Tiy typing pig -help to get usage instiuctions.
Execution Types
Pig has two execution types oi moues: local moue anu MapReuuce moue.
366 | Chapter 11: Pig
Local mode
In local moue, Pig iuns in a single ]VM anu accesses the local lilesystem. This moue is
suitaLle only loi small uatasets anu when tiying out Pig.
The execution type is set using the -x oi -exectype option. To iun in local moue, set
the option to local:
% pig -x local
This staits Giunt, the Pig inteiactive shell, which is uiscusseu in moie uetail shoitly.
MapReduce mode
In MapReuuce moue, Pig tianslates gueiies into MapReuuce joLs anu iuns them on a
Hauoop clustei. The clustei may Le a pseuuo- oi lully uistiiLuteu clustei. MapReuuce
moue (with a lully uistiiLuteu clustei) is what you use when you want to iun Pig on
laige uatasets.
To use MapReuuce moue, you liist neeu to check that the veision ol Pig you uown-
loaueu is compatiLle with the veision ol Hauoop you aie using. Pig ieleases will only
woik against paiticulai veisions ol Hauoop; this is uocumenteu in the ielease notes.
Pig honois the HADOOP_HOME enviionment vaiiaLle loi linuing which Hauoop client to
iun. Howevei il it is not set, Pig will use a Lunuleu copy ol the Hauoop liLiaiies. Note
that these may not match the veision ol Hauoop iunning on youi clustei, so it is Lest
to explicitly set HADOOP_HOME.
Next, you neeu to point Pig at the clustei`s namenoue anu joLtiackei. Il the installation
ol Hauoop at HADOOP_HOME is alieauy conliguieu loi this, then theie is nothing moie to
uo. Otheiwise, you can set HADOOP_CONF_DIR to a uiiectoiy containing the Hauoop site
lile (oi liles) that ueline fs.default.name anu mapred.job.tracker.
Alteinatively, you can set these two piopeities in the pig.propcrtics lile in Pig`s conj
uiiectoiy (oi the uiiectoiy specilieu Ly PIG_CONF_DIR). Heie`s an example loi a pseuuo-
uistiiLuteu setup:
Once you have conliguieu Pig to connect to a Hauoop clustei, you can launch Pig,
setting the -x option to mapreduce, oi omitting it entiiely, as MapReuuce moue is the
% pig
2012-01-18 20:23:05,764 [main] INFO org.apache.pig.Main - Logging error message
s to: /private/tmp/pig_1326946985762.log
2012-01-18 20:23:06,009 [main] INFO org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/
2012-01-18 20:23:06,274 [main] INFO org.apache.pig.backend.hadoop.executionengi
ne.HExecutionEngine - Connecting to map-reduce job tracker at: localhost:8021
Installing and Running Pig | 367
As you can see liom the output, Pig iepoits the lilesystem anu joLtiackei that it has
connecteu to.
Running Pig Programs
Theie aie thiee ways ol executing Pig piogiams, all ol which woik in Loth local anu
MapReuuce moue:
Pig can iun a sciipt lile that contains Pig commanus. Foi example, pig
script.pig iuns the commanus in the local lile script.pig. Alteinatively, loi veiy
shoit sciipts, you can use the -e option to iun a sciipt specilieu as a stiing on the
commanu line.
Giunt is an inteiactive shell loi iunning Pig commanus. Giunt is staiteu when no
lile is specilieu loi Pig to iun, anu the -e option is not useu. It is also possiLle to
iun Pig sciipts liom within Giunt using run anu exec.
You can iun Pig piogiams liom ]ava using the PigServer class, much like you can
use ]DBC to iun SQL piogiams liom ]ava. Foi piogiammatic access to Giunt, use
Giunt has line-euiting lacilities like those lounu in GNU Reauline (useu in the Lash
shell anu many othei commanu-line applications). Foi instance, the Ctil-E key com-
Lination will move the cuisoi to the enu ol the line. Giunt iememLeis commanu his-
toiy, too,
anu you can iecall lines in the histoiy Lullei using Ctil-P oi Ctil-N (loi
pievious anu next) oi, eguivalently, the up oi uown cuisoi keys.
Anothei hanuy leatuie is Giunt`s completion mechanism, which will tiy to complete
Pig Latin keywoius anu lunctions when you piess the TaL key. Foi example, consiuei
the lollowing incomplete line:
grunt> a = foreach b ge
Il you piess the TaL key at this point, ge will expanu to generate, a Pig Latin keywoiu:
grunt> a = foreach b generate
You can customize the completion tokens Ly cieating a lile nameu autoconp|ctc anu
placing it on Pig`s classpath (such as in the conj uiiectoiy in Pig`s install uiiectoiy), oi
in the uiiectoiy you invokeu Giunt liom. The lile shoulu have one token pei line, anu
tokens must not contain any whitespace. Matching is case-sensitive. It can Le veiy
1. Histoiy is stoieu in a lile calleu .pig_history in youi home uiiectoiy.
368 | Chapter 11: Pig
hanuy to auu commonly useu lile paths (especially Lecause Pig uoes not peiloim lile-
name completion) oi the names ol any usei-uelineu lunctions you have cieateu.
You can get a list ol commanus using the help commanu. Vhen you`ve linisheu youi
Giunt session, you can exit with the quit commanu.
Pig Latin Editors
PigPen is an Eclipse plug-in that pioviues an enviionment loi ueveloping Pig piogiams.
It incluues a Pig sciipt text euitoi, an example geneiatoi (eguivalent to the ILLUS-
TRATE commanu), anu a Lutton loi iunning the sciipt on a Hauoop clustei. Theie is
also an opeiatoi giaph winuow, which shows a sciipt in giaph loim, loi visualizing the
uata llow. Foi lull installation anu usage instiuctions, please ielei to the Pig wiki at
Theie aie also Pig Latin syntax highlighteis loi othei euitois, incluuing Vim anu Text-
Mate. Details aie availaLle on the Pig wiki.
An Example
Let`s look at a simple example Ly wiiting the piogiam to calculate the maximum
iecoiueu tempeiatuie Ly yeai loi the weathei uataset in Pig Latin (just like we uiu using
MapReuuce in Chaptei 2). The complete piogiam is only a lew lines long:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
DUMP max_temp;
To exploie what`s going on, we`ll use Pig`s Giunt inteipietei, which allows us to entei
lines anu inteiact with the piogiam to unueistanu what it`s uoing. Stait up Giunt in
local moue, then entei the liist line ol the Pig sciipt:
grunt> records = LOAD 'input/ncdc/micro-tab/sample.txt'
>> AS (year:chararray, temperature:int, quality:int);
Foi simplicity, the piogiam assumes that the input is taL-uelimiteu text, with each line
having just yeai, tempeiatuie, anu guality lielus. (Pig actually has moie llexiLility than
this with iegaiu to the input loimats it accepts, as you`ll see latei.) This line uesciiLes
the input uata we want to piocess. The year:chararray notation uesciiLes the lielu`s
name anu type; a chararray is like a ]ava stiing, anu an int is like a ]ava int. The LOAD
opeiatoi takes a URI aigument; heie we aie just using a local lile, Lut we coulu ielei
to an HDFS URI. The AS clause (which is optional) gives the lielus names to make it
convenient to ielei to them in suLseguent statements.
An Example | 369
The iesult ol the LOAD opeiatoi, inueeu any opeiatoi in Pig Latin, is a rc|ation, which
is just a set ol tuples. A tup|c is just like a iow ol uata in a uataLase taLle, with multiple
lielus in a paiticulai oiuei. In this example, the LOAD lunction piouuces a set ol (yeai,
tempeiatuie, guality) tuples that aie piesent in the input lile. Ve wiite a ielation with
one tuple pei line, wheie tuples aie iepiesenteu as comma-sepaiateu items in
Relations aie given names, oi a|iascs, so they can Le ieleiieu to. This ielation is given
the records alias. Ve can examine the contents ol an alias using the DUMP opeiatoi:
grunt> DUMP records;
Ve can also see the stiuctuie ol a ielation÷the ielation`s schcna÷using the
DESCRIBE opeiatoi on the ielation`s alias:
grunt> DESCRIBE records;
records: {year: chararray,temperature: int,quality: int}
This tells us that records has thiee lielus, with aliases year, temperature, anu quality,
which aie the names we gave them in the AS clause. The lielus have the types given to
them in the AS clause, too. Ve shall examine types in Pig in moie uetail latei.
The seconu statement iemoves iecoius that have a missing tempeiatuie (inuicateu Ly
a value ol 9999) oi an unsatislactoiy guality ieauing. Foi this small uataset, no iecoius
aie lilteieu out:
grunt> filtered_records = FILTER records BY temperature != 9999 AND
>> (quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grunt> DUMP filtered_records;
The thiiu statement uses the GROUP lunction to gioup the records ielation Ly the
year lielu. Let`s use DUMP to see what it piouuces:
grunt> grouped_records = GROUP filtered_records BY year;
grunt> DUMP grouped_records;
Ve now have two iows, oi tuples, one loi each yeai in the input uata. The liist lielu in
each tuple is the lielu Leing gioupeu Ly (the yeai), anu the seconu lielu is a Lag ol tuples
370 | Chapter 11: Pig
loi that yeai. A bag is just an unoiueieu collection ol tuples, which in Pig Latin is
iepiesenteu using cuily Liaces.
By giouping the uata in this way, we have cieateu a iow pei yeai, so now all that iemains
is to linu the maximum tempeiatuie loi the tuples in each Lag. Beloie we uo this, let`s
unueistanu the stiuctuie ol the grouped_records ielation:
grunt> DESCRIBE grouped_records;
grouped_records: {group: chararray,filtered_records: {year: chararray,
temperature: int,quality: int}}
This tells us that the giouping lielu is given the alias group Ly Pig, anu the seconu lielu
is the same stiuctuie as the filtered_records ielation that was Leing gioupeu. Vith
this inloimation, we can tiy the louith tiansloimation:
grunt> max_temp = FOREACH grouped_records GENERATE group,
>> MAX(filtered_records.temperature);
FOREACH piocesses eveiy iow to geneiate a ueiiveu set ol iows, using a GENERATE
clause to ueline the lielus in each ueiiveu iow. In this example, the liist lielu is
group, which is just the yeai. The seconu lielu is a little moie complex.
The filtered_records.temperature ieleience is to the temperature lielu ol the
filtered_records Lag in the grouped_records ielation. MAX is a Luilt-in lunction loi
calculating the maximum value ol lielus in a Lag. In this case, it calculates the maximum
tempeiatuie loi the lielus in each filtered_records Lag. Let`s check the iesult:
grunt> DUMP max_temp;
So we`ve successlully calculateu the maximum tempeiatuie loi each yeai.
Generating Examples
In this example, we`ve useu a small sample uataset with just a hanulul ol iows to make
it easiei to lollow the uata llow anu aiu ueLugging. Cieating a cut-uown uataset is an
ait, as iueally it shoulu Le iich enough to covei all the cases to exeicise youi gueiies
(the conp|ctcncss piopeity), yet Le small enough to ieason aLout Ly the piogiammei
(the conciscncss piopeity). Using a ianuom sample uoesn`t woik well in geneial, since
join anu liltei opeiations tenu to iemove all ianuom uata, leaving an empty iesult,
which is not illustiative ol the geneial uata llow.
Vith the ILLUSTRATE opeiatoi, Pig pioviues a tool loi geneiating a ieasonaLly com-
plete anu concise uataset. Heie is the output liom iunning ILLUSTRATE (slightly ie-
loimatteu to lit the page):
An Example | 371
grunt> ILLUSTRATE max_temp;
| records | year:chararray | temperature:int | quality:int |
| | 1949 | 78 | 1 |
| | 1949 | 111 | 1 |
| | 1949 | 9999 | 1 |
| filtered_records | year:chararray | temperature:int | quality:int |
| | 1949 | 78 | 1 |
| | 1949 | 111 | 1 |
| grouped_records | group:chararray | filtered_records:bag{:tuple(year:chararray, |
temperature:int,quality:int)} |
| | 1949 | {(1949, 78, 1), (1949, 111, 1)} |
| max_temp | group:chararray | :int |
| | 1949 | 111 |
Notice that Pig useu some ol the oiiginal uata (this is impoitant to keep the geneiateu
uataset iealistic), as well as cieating some new uata. It noticeu the special value 9999
in the gueiy anu cieateu a tuple containing this value to exeicise the FILTER statement.
In summaiy, the output ol the ILLUSTRATE is easy to lollow anu can help you un-
ueistanu what youi gueiy is uoing.
Comparison with Databases
Having seen Pig in action, it might seem that Pig Latin is similai to SQL. The piesence
ol such opeiatois as GROUP BY anu DESCRIBE ieinloices this impiession. Howevei,
theie aie seveial uilleiences Letween the two languages, anu Letween Pig anu RDBMSs
in geneial.
The most signilicant uilleience is that Pig Latin is a uata llow piogiamming language,
wheieas SQL is a ueclaiative piogiamming language. In othei woius, a Pig Latin pio-
giam is a step-Ly-step set ol opeiations on an input ielation, in which each step is a
single tiansloimation. By contiast, SQL statements aie a set ol constiaints that, taken
togethei, ueline the output. In many ways, piogiamming in Pig Latin is like woiking
at the level ol an RDBMS gueiy plannei, which liguies out how to tuin a ueclaiative
statement into a system ol steps.
RDBMSs stoie uata in taLles, with tightly pieuelineu schemas. Pig is moie ielaxeu aLout
the uata that it piocesses: you can ueline a schema at iuntime, Lut it`s optional. Es-
sentially, it will opeiate on any souice ol tuples (although the souice shoulu suppoit
372 | Chapter 11: Pig
Leing ieau in paiallel, Ly Leing in multiple liles, loi example), wheie a UDF is useu to
ieau the tuples liom theii iaw iepiesentation.
The most common iepiesentation is a
text lile with taL-sepaiateu lielus, anu Pig pioviues a Luilt-in loau lunction loi this
loimat. Unlike with a tiauitional uataLase, theie is no uata impoit piocess to loau the
uata into the RDBMS. The uata is loaueu liom the lilesystem (usually HDFS) as the
liist step in the piocessing.
Pig`s suppoit loi complex, nesteu uata stiuctuies uilleientiates it liom SQL, which
opeiates on llattei uata stiuctuies. Also, Pig`s aLility to use UDFs anu stieaming opei-
atois that aie tightly integiateu with the language anu Pig`s nesteu uata stiuctuies
makes Pig Latin moie customizaLle than most SQL uialects.
Theie aie seveial leatuies to suppoit online, low-latency gueiies that RDBMSs have
that aie aLsent in Pig, such as tiansactions anu inuexes. As mentioneu eailiei, Pig uoes
not suppoit ianuom ieaus oi gueiies in the oiuei ol tens ol milliseconus. Noi uoes it
suppoit ianuom wiites to upuate small poitions ol uata; all wiites aie Lulk, stieaming
wiites, just like MapReuuce.
Hive (coveieu in Chaptei 12) sits Letween Pig anu conventional RDBMSs. Like Pig,
Hive is uesigneu to use HDFS loi stoiage, Lut otheiwise theie aie some signilicant
uilleiences. Its gueiy language, HiveQL, is Laseu on SQL, anu anyone who is lamiliai
with SQL woulu have little tiouLle wiiting gueiies in HiveQL. Like RDBMSs, Hive
manuates that all uata Le stoieu in taLles, with a schema unuei its management; how-
evei, it can associate a schema with pieexisting uata in HDFS, so the loau step is
optional. Hive uoes not suppoit low-latency gueiies, a chaiacteiistic it shaies with Pig.
Pig Latin
This section gives an inloimal uesciiption ol the syntax anu semantics ol the Pig Latin
piogiamming language.
It is not meant to ollei a complete ieleience to the
Lut theie shoulu Le enough heie loi you to get a goou unueistanuing ol Pig
Latin`s constiucts.
A Pig Latin piogiam consists ol a collection ol statements. A statement can Le thought
ol as an opeiation, oi a commanu.
Foi example, a GROUP opeiation is a type ol
2. Oi as the Pig Philosophy has it, ¨Pigs eat anything.¨
3. Not to Le conluseu with Pig Latin, the language game. English woius aie tianslateu into Pig Latin Ly
moving the initial consonant sounu to the enu ol the woiu anu auuing an ¨ay¨ sounu. Foi example, ¨pig¨
Lecomes ¨ig-pay,¨ anu ¨Hauoop¨ Lecomes ¨Auoop-hay.¨
+. Pig Latin uoes not have a loimal language uelinition as such, Lut theie is a compiehensive guiue to the
language that can Le lounu linkeu to liom the Pig weLsite at http://pig.apachc.org/.
Pig Latin | 373
grouped_records = GROUP records BY year;
The commanu to list the liles in a Hauoop lilesystem is anothei example ol a statement:
ls /
Statements aie usually teiminateu with a semicolon, as in the example ol the GROUP
statement. In lact, this is an example ol a statement that must Le teiminateu with a
semicolon: it is a syntax eiioi to omit it. The ls commanu, on the othei hanu, uoes not
have to Le teiminateu with a semicolon. As a geneial guiueline, statements oi com-
manus loi inteiactive use in Giunt uo not neeu the teiminating semicolon. This gioup
incluues the inteiactive Hauoop commanus, as well as the uiagnostic opeiatois like
DESCRIBE. It`s nevei an eiioi to auu a teiminating semicolon, so il in uouLt, it`s sim-
plest to auu one.
Statements that have to Le teiminateu with a semicolon can Le split acioss multiple
lines loi ieauaLility:
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
Pig Latin has two loims ol comments. DouLle hyphens aie single-line comments.
Eveiything liom the liist hyphen to the enu ol the line is ignoieu Ly the Pig Latin
-- My program
DUMP A; -- What's in A?
C-style comments aie moie llexiLle since they uelimit the Leginning anu enu ol the
comment Llock with /* anu */ maikeis. They can span lines oi Le emLeuueu in a single
* Description of my program spanning
* multiple lines.
A = LOAD 'input/pig/join/A';
B = LOAD 'input/pig/join/B';
C = JOIN A BY $0, /* ignored */ B BY $1;
Pig Latin has a list ol keywoius that have a special meaning in the language anu cannot
Le useu as iuentilieis. These incluue the opeiatois (LOAD, ILLUSTRATE), commanus
(cat, ls), expiessions (matches, FLATTEN), anu lunctions (DIFF, MAX)÷all ol which
aie coveieu in the lollowing sections.
Pig Latin has mixeu iules on case sensitivity. Opeiatois anu commanus aie not case-
sensitive (to make inteiactive use moie loigiving); howevei, aliases anu lunction names
aie case-sensitive.
5. You sometimes see these teims Leing useu inteichangeaLly in uocumentation on Pig Latin. Foi example,
¨GROUP commanu, ¨ ¨GROUP opeiation,¨ ¨GROUP statement.¨
374 | Chapter 11: Pig
As a Pig Latin piogiam is executeu, each statement is paiseu in tuin. Il theie aie syntax
eiiois, oi othei (semantic) pioLlems such as unuelineu aliases, the inteipietei will halt
anu uisplay an eiioi message. The inteipietei Luilus a |ogica| p|an loi eveiy ielational
opeiation, which loims the coie ol a Pig Latin piogiam. The logical plan loi the state-
ment is auueu to the logical plan loi the piogiam so lai, then the inteipietei moves on
to the next statement.
It`s impoitant to note that no uata piocessing takes place while the logical plan ol the
piogiam is Leing constiucteu. Foi example, consiuei again the Pig Latin piogiam liom
the liist example:
-- max_temp.pig: Finds the maximum temperature by year
records = LOAD 'input/ncdc/micro-tab/sample.txt'
AS (year:chararray, temperature:int, quality:int);
filtered_records = FILTER records BY temperature != 9999 AND
(quality == 0 OR quality == 1 OR quality == 4 OR quality == 5 OR quality == 9);
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,
DUMP max_temp;
Vhen the Pig Latin inteipietei sees the liist line containing the LOAD statement, it
conliims that it is syntactically anu semantically coiiect, anu auus it to the logical plan,
Lut it uoes not loau the uata liom the lile (oi even check whethei the lile exists). Inueeu,
wheie woulu it loau it? Into memoiy? Even il it uiu lit into memoiy, what woulu it uo
with the uata? Peihaps not all the input uata is neeueu (since latei statements liltei it,
loi example), so it woulu Le pointless to loau it. The point is that it makes no sense to
stait any piocessing until the whole llow is uelineu. Similaily, Pig valiuates the GROUP
anu FOREACH...GENERATE statements, anu auus them to the logical plan without
executing them. The tiiggei loi Pig to stait execution is the DUMP statement. At that
point, the logical plan is compileu into a physical plan anu executeu.
Multiquery execution
Since DUMP is a uiagnostic tool, it will always tiiggei execution. Howevei, the STORE
commanu is uilleient. In inteiactive moue, STORE acts like DUMP anu will always
tiiggei execution (this incluues the run commanu), Lut in Latch moue it will not (this
incluues the exec commanu). The ieason loi this is elliciency. In Latch moue, Pig will
paise the whole sciipt to see il theie aie any optimizations that coulu Le maue to limit
the amount ol uata to Le wiitten to oi ieau liom uisk. Consiuei the lollowing simple
A = LOAD 'input/pig/multiquery/A';
B = FILTER A BY $1 == 'banana';
C = FILTER A BY $1 != 'banana';
STORE B INTO 'output/b';
STORE C INTO 'output/c';
Pig Latin | 375
Relations B anu C aie Loth ueiiveu liom A, so to save ieauing A twice, Pig can iun this
sciipt as a single MapReuuce joL Ly ieauing A once anu wiiting two output liles liom
the joL, one loi each ol B anu C. This leatuie is calleu nu|tiqucry cxccution.
In pievious veisions ol Pig that uiu not have multigueiy execution, each STORE state-
ment in a sciipt iun in Latch moue tiiggeieu execution, iesulting in a joL loi each
STORE statement. It is possiLle to iestoie the olu Lehavioi Ly uisaLling multigueiy
execution with the -M oi -no_multiquery option to pig.
The physical plan that Pig piepaies is a seiies ol MapReuuce joLs, which in local moue
Pig iuns in the local ]VM, anu in MapReuuce moue Pig iuns on a Hauoop clustei.
You can see the logical anu physical plans cieateu Ly Pig using the
EXPLAIN commanu on a ielation (EXPLAIN max_temp; loi example).
EXPLAIN will also show the MapReuuce plan, which shows how the
physical opeiatois aie gioupeu into MapReuuce joLs. This is a goou
way to linu out how many MapReuuce joLs Pig will iun loi youi gueiy.
The ielational opeiatois that can Le a pait ol a logical plan in Pig aie summaiizeu in
TaLle 11-1. Ve shall go thiough the opeiatois in moie uetail in ¨Data Piocessing
Opeiatois¨ on page 397.
Tab|c 11-1. Pig Latin rc|ationa| opcrators
Category Operator Description
Loading and storing LOAD Loads data from the filesystem or other storage into a relation
STORE Saves a relation to the filesystem or other storage
DUMP Prints a relation to the console
Filtering FILTER Removes unwanted rows from a relation
DISTINCT Removes duplicate rows from a relation
FOREACH...GENERATE Adds or removes fields from a relation
MAPREDUCE Runs a MapReduce job using a relation as input
STREAM Transforms a relation using an external program
SAMPLE Selects a random sample of a relation
Grouping and joining JOIN Joins two or more relations
COGROUP Groups the data in two or more relations
GROUP Groups the data in a single relation
CROSS Creates the cross-product of two or more relations
Sorting ORDER Sorts a relation by one or more fields
LIMIT Limits the size of a relation to a maximum number of tuples
Combining and splitting UNION Combines two or more relations into one
376 | Chapter 11: Pig
Category Operator Description
SPLIT Splits a relation into two or more relations
Theie aie othei types ol statements that aie not auueu to the logical plan. Foi example,
the uiagnostic opeiatois, DESCRIBE, EXPLAIN, anu ILLUSTRATE aie pioviueu to
allow the usei to inteiact with the logical plan, loi ueLugging puiposes (see Ta-
Lle 11-2). DUMP is a soit ol uiagnostic opeiatoi, too, since it is useu only to allow
inteiactive ueLugging ol small iesult sets oi in comLination with LIMIT to ietiieve a
lew iows liom a laigei ielation. The STORE statement shoulu Le useu when the size
ol the output is moie than a lew lines, as it wiites to a lile, iathei than to the console.
Tab|c 11-2. Pig Latin diagnostic opcrators
Operator Description
DESCRIBE Prints a relation’s schema
EXPLAIN Prints the logical and physical plans
ILLUSTRATE Shows a sample execution of the logical plan, using a generated subset of the input
Pig Latin pioviues thiee statements, REGISTER, DEFINE anu IMPORT, to make it
possiLle to incoipoiate macios anu usei-uelineu lunctions into Pig sciipts (see Ta-
Lle 11-3).
Tab|c 11-3. Pig Latin nacro and UDI statcncnts
Statement Description
REGISTER Registers a JAR file with the Pig runtime
DEFINE Creates an alias for a macro, a UDF, streaming script, or a command specification
IMPORT Import macros defined in a separate file into a script
Since they uo not piocess ielations, commanus aie not auueu to the logical plan; in-
steau, they aie executeu immeuiately. Pig pioviues commanus to inteiact with Hauoop
lilesystems (which aie veiy hanuy loi moving uata aiounu Leloie oi altei piocessing
with Pig) anu MapReuuce, as well as a lew utility commanus (uesciiLeu in TaLle 11-+).
Tab|c 11-1. Pig Latin connands
Category Command Description
Hadoop Filesystem cat Prints the contents of one or more files
cd Changes the current directory
copyFromLocal Copies a local file or directory to a Hadoop filesystem
copyToLocal Copies a file or directory on a Hadoop filesystem to the local filesystem
cp Copies a file or directory to another directory
fs Ac