Vous êtes sur la page 1sur 7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev

FROMDEV
24
Hadoop Interview Questions & Answers for MapReduce developers
http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html
AgoodunderstandingofHadoopArchitectureisrequiredtoleveragethepowerofHadoop.Belowarefewimportant
practicalquestionswhichcanbeaskedtoaSeniorExperiencedHadoopDeveloperinaninterview.Ilearnedthe
answerstothemduringmyCCHD(ClouderaCertifiedHaddopDeveloper)certification.Ihopeyouwillfindthem
useful.ThislistprimarilyincludesquestionsrelatedtoHadoopArchitecture,MapReduce,HadoopAPIandHadoop
DistributedFileSystem(HDFS).
Hadoopisthemostpopularplatformforbigdataanalysis.TheHadoopecosystemishugeandinvolvesmany
supportingframeworksandtoolstoeffectivelyrunandmanageit.ThisarticlefocusesonthecoreofHadoopconceptsanditstechniquetohandle
enormousdata.
Belowlistofhadoopinterviewquestionsandanswersthatmayproveusefulforbeginnersandexpertsalike.Thesearecommonsetofquestions
thatyoumayfaceatbigdatajobintervieworahadoopcertificationexam(likeCCHD).

What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
JobTrackeristhedaemonserviceforsubmittingandtrackingMapReducejobsinHadoop.ThereisonlyOneJobTrackerprocessrunonany
hadoopcluster.JobTrackerrunsonitsownJVMprocess.Inatypicalproductionclusteritsrunonaseparatemachine.Eachslavenodeis
configuredwithjobtrackernodelocation.TheJobTrackerissinglepointoffailurefortheHadoopMapReduceservice.Ifitgoesdown,allrunning
jobsarehalted.JobTrackerinHadoopperformsfollowingactions(fromHadoopWiki:)
ClientapplicationssubmitjobstotheJobtracker.
TheJobTrackertalkstotheNameNodetodeterminethelocationofthedata
TheJobTrackerlocatesTaskTrackernodeswithavailableslotsatornearthedata
TheJobTrackersubmitstheworktothechosenTaskTrackernodes.
TheTaskTrackernodesaremonitored.Iftheydonotsubmitheartbeatsignalsoftenenough,theyaredeemedtohavefailedandthe
workisscheduledonadifferentTaskTracker.
ATaskTrackerwillnotifytheJobTrackerwhenataskfails.TheJobTrackerdecideswhattodothen:itmayresubmitthejobelsewhere,
itmaymarkthatspecificrecordassomethingtoavoid,anditmaymayevenblacklisttheTaskTrackerasunreliable.
Whentheworkiscompleted,theJobTrackerupdatesitsstatus.

ClientapplicationscanpolltheJobTrackerforinformation.

How JobTracker schedules a task?


TheTaskTrackerssendoutheartbeatmessagestotheJobTracker,usuallyeveryfewminutes,toreassuretheJobTrackerthatitisstillalive.
ThesemessagealsoinformtheJobTrackerofthenumberofavailableslots,sotheJobTrackercanstayuptodatewithwhereintheclusterwork
canbedelegated.WhentheJobTrackertriestofindsomewheretoscheduleataskwithintheMapReduceoperations,itfirstlooksforanempty
slotonthesameserverthathoststheDataNodecontainingthedata,andifnot,itlooksforanemptyslotonamachineinthesamerack.

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
ATaskTrackerisaslavenodedaemonintheclusterthatacceptstasks(Map,ReduceandShuffleoperations)fromaJobTracker.Thereisonly
OneTaskTrackerprocessrunonanyhadoopslavenode.TaskTrackerrunsonitsownJVMprocess.EveryTaskTrackerisconfiguredwithaset
ofslots,theseindicatethenumberoftasksthatitcanaccept.TheTaskTrackerstartsaseparateJVMprocessestodotheactualwork(calledas
TaskInstance)thisistoensurethatprocessfailuredoesnottakedownthetasktracker.TheTaskTrackermonitorsthesetaskinstances,capturing
theoutputandexitcodes.WhentheTaskinstancesfinish,successfullyornot,thetasktrackernotifiestheJobTracker.TheTaskTrackersalso
http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

1/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev

sendoutheartbeatmessagestotheJobTracker,usuallyeveryfewminutes,toreassuretheJobTrackerthatitisstillalive.Thesemessagealso
informtheJobTrackerofthenumberofavailableslots,sotheJobTrackercanstayuptodatewithwhereintheclusterworkcanbedelegated.

What is a Task instance in Hadoop? Where does it run?


TaskinstancesaretheactualMapReducejobswhicharerunoneachslavenode.TheTaskTrackerstartsaseparateJVMprocessestodothe
actualwork(calledasTaskInstance)thisistoensurethatprocessfailuredoesnottakedownthetasktracker.EachTaskInstancerunsonitsown
JVMprocess.Therecanbemultipleprocessesoftaskinstancerunningonaslavenode.Thisisbasedonthenumberofslotsconfiguredontask
tracker.BydefaultanewtaskinstanceJVMprocessisspawnedforatask.

How many Daemon processes run on a Hadoop system?


Hadoopiscomprisedoffiveseparatedaemons.EachofthesedaemonruninitsownJVM.Following3DaemonsrunonMasternodes
NameNodeThisdaemonstoresandmaintainsthemetadataforHDFS.SecondaryNameNodePerformshousekeepingfunctionsforthe
NameNode.JobTrackerManagesMapReducejobs,distributesindividualtaskstomachinesrunningtheTaskTracker.Following2Daemons
runoneachSlavenodesDataNodeStoresactualHDFSdatablocks.TaskTrackerResponsibleforinstantiatingandmonitoringindividualMap
andReducetasks.

What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
SingleinstanceofaTaskTrackerisrunoneachSlavenode.TasktrackerisrunasaseparateJVMprocess.
SingleinstanceofaDataNodedaemonisrunoneachSlavenode.DataNodedaemonisrunasaseparateJVMprocess.
OneorMultipleinstancesofTaskInstanceisrunoneachslavenode.EachtaskinstanceisrunasaseparateJVMprocess.The
numberofTaskinstancescanbecontrolledbyconfiguration.Typicallyahighendmachineisconfiguredtorunmoretaskinstances.

What is the difference between HDFS and NAS ?


TheHadoopDistributedFileSystem(HDFS)isadistributedfilesystemdesignedtorunoncommodityhardware.Ithasmanysimilaritieswith
existingdistributedfilesystems.However,thedifferencesfromotherdistributedfilesystemsaresignificant.Followingaredifferencesbetween
HDFSandNAS
InHDFSDataBlocksaredistributedacrosslocaldrivesofallmachinesinacluster.WhereasinNASdataisstoredondedicated
hardware.
HDFSisdesignedtoworkwithMapReduceSystem,sincecomputationaremovedtodata.NASisnotsuitableforMapReducesince
dataisstoredseperatelyfromthecomputations.
HDFSrunsonaclusterofmachinesandprovidesredundancyusingareplicationprotocal.WhereasNASisprovidedbyasingle
machinethereforedoesnotprovidedataredundancy.

How NameNode Handles data node failures?


NameNodeperiodicallyreceivesaHeartbeatandaBlockreportfromeachoftheDataNodesinthecluster.ReceiptofaHeartbeatimpliesthatthe
DataNodeisfunctioningproperly.ABlockreportcontainsalistofallblocksonaDataNode.WhenNameNodenoticesthatithasnotrecieveda
hearbeatmessagefromadatanodeafteracertainamountoftime,thedatanodeismarkedasdead.Sinceblockswillbeunderreplicatedthe
systembeginsreplicatingtheblocksthatwerestoredonthedeaddatanode.TheNameNodeOrchestratesthereplicationofdatablocksfromone
datanodetoanother.Thereplicationdatatransferhappensdirectlybetweendatanodesandthedataneverpassesthroughthenamenode.

Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer
communicate with another reducer?
Nope,MapReduceprogrammingmodeldoesnotallowreducerstocommunicatewitheachother.Reducersruninisolation.

Can I set the number of reducers to zero?


Yes,SettingthenumberofreducerstozeroisavalidconfigurationinHadoop.Whenyousetthereducerstozeronoreducerswillbeexecuted,
andtheoutputofeachmapperwillbestoredtoaseparatefileonHDFS.[Thisisdifferentfromtheconditionwhenreducersaresettoanumber
greaterthanzeroandtheMappersoutput(intermediatedata)iswrittentotheLocalfilesystem(NOTHDFS)ofeachmappterslavenode.]

Where is the Mapper Output (intermediate kay-value data) stored ?


Themapperoutput(intermediatedata)isstoredontheLocalfilesystem(NOTHDFS)ofeachindividualmappernodes.Thisistypicallya
temporarydirectorylocationwhichcanbesetupinconfigbythehadoopadministrator.TheintermediatedataiscleanedupaftertheHadoopJob
http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

2/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev

completes.

What are combiners? When should I use a combiner in my MapReduce Job?


CombinersareusedtoincreasetheefficiencyofaMapReduceprogram.Theyareusedtoaggregateintermediatemapoutputlocallyonindividual
mapperoutputs.Combinerscanhelpyoureducetheamountofdatathatneedstobetransferredacrosstothereducers.Youcanuseyourreducer
codeasacombineriftheoperationperformediscommutativeandassociative.Theexecutionofcombinerisnotguaranteed,Hadoopmayormay
notexecuteacombiner.Also,ifrequireditmayexecuteitmorethen1times.ThereforeyourMapReducejobsshouldnotdependonthe
combinersexecution.

What is Writable & WritableComparable interface?


org.apache.hadoop.io.WritableisaJavainterface.AnykeyorvaluetypeintheHadoopMapReduceframeworkimplementsthis
interface.Implementationstypicallyimplementastaticread(DataInput)methodwhichconstructsanewinstance,calls
readFields(DataInput)andreturnstheinstance.
org.apache.hadoop.io.WritableComparableisaJavainterface.AnytypewhichistobeusedasakeyintheHadoopMapReduce
frameworkshouldimplementthisinterface.WritableComparableobjectscanbecomparedtoeachotherusingComparators.

What is the Hadoop MapReduce API contract for a key and value Class?
TheKeymustimplementtheorg.apache.hadoop.io.WritableComparableinterface.
Thevaluemustimplementtheorg.apache.hadoop.io.Writableinterface.

What is a IdentityMapper and IdentityReducer in MapReduce ?


org.apache.hadoop.mapred.lib.IdentityMapperImplementstheidentityfunction,mappinginputsdirectlytooutputs.IfMapReduce
programmerdonotsettheMapperClassusingJobConf.setMapperClassthenIdentityMapper.classisusedasadefaultvalue.
org.apache.hadoop.mapred.lib.IdentityReducerPerformsnoreduction,writingallinputvaluesdirectlytotheoutput.IfMapReduce
programmerdonotsettheReducerClassusingJobConf.setReducerClassthenIdentityReducer.classisusedasadefaultvalue.

What is the meaning of speculative execution in Hadoop? Why is it important?


SpeculativeexecutionisawayofcopingwithindividualMachineperformance.Inlargeclusterswherehundredsorthousandsofmachinesare
involvedtheremaybemachineswhicharenotperformingasfastasothers.Thismayresultindelaysinafulljobduetoonlyonemachinenot
performaingwell.Toavoidthis,speculativeexecutioninhadoopcanrunmultiplecopiesofsamemaporreducetaskondifferentslavenodes.The
resultsfromfirstnodetofinishareused.

When is the reducers are started in a MapReduce job?


InaMapReducejobreducersdonotstartexecutingthereducemethoduntiltheallMapjobshavecompleted.Reducersstartcopyingintermediate
keyvaluepairsfromthemappersassoonastheyareavailable.Theprogrammerdefinedreducemethodiscalledonlyafterallthemappershave
finished.

If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%)
Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Reducersstartcopyingintermediatekeyvaluepairsfromthemappersassoonastheyareavailable.Theprogresscalculationalsotakesin
accounttheprocessingofdatatransferwhichisdonebyreduceprocess,thereforethereduceprogressstartsshowingupassoonasany
intermediatekeyvaluepairforamapperisavailabletobetransferredtoreducer.Thoughthereducerprogressisupdatedstilltheprogrammer
definedreducemethodiscalledonlyafterallthemappershavefinished.

What is HDFS ? How it is different from traditional file systems?


HDFS,theHadoopDistributedFileSystem,isresponsibleforstoringhugedataonthecluster.Thisisadistributedfilesystemdesignedtorunon
commodityhardware.Ithasmanysimilaritieswithexistingdistributedfilesystems.However,thedifferencesfromotherdistributedfilesystemsare
significant.
HDFSishighlyfaulttolerantandisdesignedtobedeployedonlowcosthardware.
HDFSprovideshighthroughputaccesstoapplicationdataandissuitableforapplicationsthathavelargedatasets.
http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

3/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev

HDFSisdesignedtosupportverylargefiles.ApplicationsthatarecompatiblewithHDFSarethosethatdealwithlargedatasets.
Theseapplicationswritetheirdataonlyoncebuttheyreaditoneormoretimesandrequirethesereadstobesatisfiedatstreaming
speeds.HDFSsupportswriteoncereadmanysemanticsonfiles.

What is HDFS Block size? How is it different from traditional file system block size?
InHDFSdataissplitintoblocksanddistributedacrossmultiplenodesinthecluster.Eachblockistypically64Mbor128Mbinsize.Eachblockis
replicatedmultipletimes.Defaultistoreplicateeachblockthreetimes.Replicasarestoredondifferentnodes.HDFSutilizesthelocalfilesystem
tostoreeachHDFSblockasaseparatefile.HDFSBlocksizecannotbecomparedwiththetraditionalfilesystemblocksize.

What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?


TheNameNodeisthecenterpieceofanHDFSfilesystem.Itkeepsthedirectorytreeofallfilesinthefilesystem,andtrackswhereacrossthe
clusterthefiledataiskept.Itdoesnotstorethedataofthesefilesitself.ThereisonlyOneNameNodeprocessrunonanyhadoopcluster.
NameNoderunsonitsownJVMprocess.Inatypicalproductionclusteritsrunonaseparatemachine.TheNameNodeisaSinglePointofFailure
fortheHDFSCluster.WhentheNameNodegoesdown,thefilesystemgoesoffline.ClientapplicationstalktotheNameNodewhenevertheywish
tolocateafile,orwhentheywanttoadd/copy/move/deleteafile.TheNameNoderespondsthesuccessfulrequestsbyreturningalistofrelevant
DataNodeserverswherethedatalives.

What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?


ADataNodestoresdataintheHadoopFileSystemHDFS.ThereisonlyOneDataNodeprocessrunonanyhadoopslavenode.DataNoderuns
onitsownJVMprocess.Onstartup,aDataNodeconnectstotheNameNode.DataNodeinstancescantalktoeachother,thisismostlyduring
replicatingdata.

How the Client communicates with HDFS?


TheClientcommunicationtoHDFShappensusingHadoopHDFSAPI.ClientapplicationstalktotheNameNodewhenevertheywishtolocatea
file,orwhentheywanttoadd/copy/move/deleteafileonHDFS.TheNameNoderespondsthesuccessfulrequestsbyreturningalistofrelevant
DataNodeserverswherethedatalives.ClientapplicationscantalkdirectlytoaDataNode,oncetheNameNodehasprovidedthelocationofthe
data.

How the HDFS Blocks are replicated?


HDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasasequenceofblocksallblocksinafile
exceptthelastblockarethesamesize.Theblocksofafilearereplicatedforfaulttolerance.Theblocksizeandreplicationfactorareconfigurable
perfile.Anapplicationcanspecifythenumberofreplicasofafile.Thereplicationfactorcanbespecifiedatfilecreationtimeandcanbechanged
later.FilesinHDFSarewriteonceandhavestrictlyonewriteratanytime.TheNameNodemakesalldecisionsregardingreplicationofblocks.
HDFSusesrackawarereplicaplacementpolicy.Indefaultconfigurationtherearetotal3copiesofadatablockonHDFS,2copiesarestoredon
datanodesonsamerackand3rdcopyonadifferentrack.

Canyouthinkofaquestionswhichisnotpartofthispost?Pleasedon'tforgettoshareitwithmeincommentssection&Iwilltrytoincludeitinthe
list.
Postedby SachinFromDev

POST A COMMENT

DEFAULTCOMMENTS

FACEBOOKCOMMENTS

220 comments

Add a comment

Top comments

Sachin FromDev 2 years ago - Hadoop (Hadoop)

Some
of the useful questions I documented for screening of the hadoop developers. I hope you will
find these useful.

http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

4/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev
In general, My opinion is - its more important to ask fundamental questions about the hadoop
ecosystem and the distributed computing approach instead of jumping onto the Map reduce
problems.
+6
1
7
View all 4 replies
Bhaskar Karambelkar 2 years ago
Hey Sachin, We're currently upgrading from CDH 4.2.1 to CDH4.3.0.
Cloudera added HA with CDH4, release. And IFAIK MapR as well as Hortonworks also support HA.
Sachin FromDev 2 years ago
Wow thats gud to know. Time for me to refresh a bit. Thx for the update.

FromDev via Google+ 2 months ago - Shared publicly

This
is one of the most detailed interview questions list on hadoop ecosystem. I created this after my
Cloudera Certified Hadoop Developer (CCHD) Certification. Hope you find it useful. Let me know if
more questions and answers need to be added to this list.
+1
2

1 Reply

sathya raj 2 months ago


Please add some more questions which you got in CCHD exams. This will help people like me
who are preparing for the exam in great extend. Thanks in advance !!!

QASIM ASHRAF 2 months ago - Shared publicly


I was Looking for simple explanation for HDFS Architecture
#Sachin Your Brilliant Explanation saves a lot of time ... :)
#ThankYou
+1
2

1 Reply

ramya parvathaneni 9 months ago - Shared publicly

Hi,
good content to viewers hadoop experts provides best online training on
<a href="http://mjtrainings.com/hadoop-online-training">hadoop online training</a>
by real time experienced experts
+1
2

1 Reply

Sachin FromDev via Google+ 2 months ago - Shared publicly


obvious popular topic is Hadoop. This was one of the most popular article in 2014 and 2013, it still
An
remains popular due to big data analytics and increasing demand of hadoop developers. The content
in this article were created immediately after my CCHD (Cloudera Certified Hadoop Developer)
certification and many people have got benefitted with it during their certification preparation.The
monthly page impressions per google analytics on this article are 22k+ and increasing.
+2
3

1 Reply

Alex Gordon via Google+ 2 years ago - Shared publicly


Sachin FromDev originally shared this
Some of the useful questions I documented for screening of the hadoop developers. I
hope you will find these useful.
In general, My opinion is - its more important to ask fundamental questions about the
hadoop ecosystem and the distributed computing approach instead of jumping onto the
Map reduce problems.
+2
3

1 Reply

Sachin FromDev via Google+ 2 months ago - Shared publicly


FromDev originally shared this
This is one of the most detailed interview questions list on hadoop ecosystem. I created
this after my Cloudera Certified Hadoop Developer (CCHD) Certification. Hope you find it
useful. Let me know if more questions and answers need to be added to this list.

http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

5/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev
+1
2

1 Reply

Srilatha Bandari 2 weeks ago - Shared publicly

Firstly,
I want to thank you for providing clear, useful and detail information for the hadoop
beginnners. It looks like all the questions are related to Hadoop 1.x, would be good if Hadoop 2.x
related questions are also added.
1 Reply

James Kobielus 1 year ago - Shared publicly


Interview Questions & Answers for #Hadoop MapReduce developers" (http://ow.ly/vzbsv) JK-"24
More like an FAQ reference for those U hire
1 Reply

kuldeep kulkarni 1 month ago - Shared publicly


This is nice article for Beginners! Thanks for your efforts :-)
Please visit my personal blog for more hadoop knowledge.
http://crazyadmins.com
1 Reply

avishan sharafi 1 month ago - Shared publicly

Hi
How can I change the slot number in hadoop ?How should i changemapred-site.xml in hadoop to
change the slot number according to my formula?could you please tell me the code?!
1 Reply

Smart Shyam (BornMad) 3 months ago - Shared publicly

Nice
questions..definately of great help. Thanks a lot!
http://www.bestandroidtrainingchennai.in/
1

jack wilson 6 months ago - Shared publicly


I get a lot of great information here and this is what I am expecting for Hadoop. Thank you for your
sharing this informative blog. I have completed hadoop certification course at a leading academy. If
you are looking for best Hadoop Training in Chennai visit FITA IT training and placement academy. To
know more about hadoop training please refer this link.
http://www.fita.in/big-data-hadoop-training-in-chennai/
1 Reply

Rajesh Ravi 9 months ago - Shared publicly

Really
is very interesting, I saw your website and get more details..Nice work. Thanks regards,
Refer this link below
http://www.sastraininginchennai.in
1 Reply

Ihsan Ullah 9 months ago (edited) - Shared publicly

Question
:
How the Namenode knows the location of the datanode, that it is connected in the local rack or
remote rack. Or how namenode find the location of the rack and node.?

+3
4

1 Reply

josez davis 7 months ago - Shared publicly


any man, zero seem is ever likely to become completely comprehensive without the excellent
For
appreciate. Your present enjoy must express because your personal personality simply because rest
of the clothing (or more therefore). If you own identical cheap plastic dslr camera view you had
Several years previously, upgrading is a lot delinquent. That
http://www.cocolocopartyband.com/webi.html will cherish

http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

6/7

26/08/2015

24HadoopInterviewQuestions&AnswersforMapReducedevelopers|FromDev
http://www.cocolocopartyband.com/webi.html will cherish
http://www.theaviationzone.com/footi.html wasn't great Quite a while ago, also it definitely isn't
1 Reply

Rajesh Ravi 9 months ago - Shared publicly

Thanks
to Share the LoadRunner Material for Freshers,Link as,
http://www.loadrunnertraining.in
1 Reply

Rajesh Ravi 9 months ago - Shared publicly

Thanks to Share the LoadRunner Material for Freshers,


Link as,
<a href="http://www.loadrunnertraining.in/about-besant-technologies.html">LoadRunnerTraining in
Chennai</a>
1 Reply

Ashok Raj 2 months ago - Shared publicly

Wonderful
explanation on HDFS architecture.
Thank you for sharing..
1 Reply

Mohammed Iliyas 5 months ago - Shared publicly


I am learning Hadoop course, if anybody wrote the cloudera hadoop developer certificcation,
Hi,
please share the few question, What type of question are there and where we can more focus.
1 Reply

Show more

...

http://www.fromdev.com/2010/12/interviewquestionshadoopmapreduce.html

7/7

Vous aimerez peut-être aussi