Académique Documents
Professionnel Documents
Culture Documents
Lesson 4 Notes
MynameisAndy,Imgoingtobepresentingthislesson,wherewearegoingto
learnaboutdesignpatternsinmapreduce.Thepatternsthatwewillcoverare
templatesforsolvingcommonproblemswithmapreduce.Thesepatternsachieve
agoodbalancebetweenflexibilityandrigidity:Theyaregeneralenoughthatthey
canbeadaptedtosolvelotsofproblems,butspecificenoughthattheydontrequire
toomuchefforttouse.Thegoalhereis
togiveyoufamiliarity(butnot
necessarilymastery)withsomeof
thesepatterns.Wewontdiscusswhere
thesepatternscamefromorwhythey
werechosen.Wearejustgoingto
introducethem.Note:thesepatterns
comefromthebook:MapReduceDesignPatternsbyDonaldMiner&AdamShook.
NoteveryproblemcanbesolvedwithMapReduce.Infact,problemshavetobemanipulatedto
fitintothisframeworkofmappingandreducing.Butluckilyprogrammershavedevelopeda
sophisticatedarsenalofpatternstomakesomepotentiallyunexpectedproblems
mapreducable.Inthislesson,wellintroducethefollowingpatterns.
Copyright2014Udacity,Inc.AllRightsReserved.
FilteringPatterns:dontchangerecords.Onlygetapartofthedata!
Examples:sampling,randomsampling,top10
SummarizationPatterns:Givetoplevelviewofdata.
Examples:counting,minimum,maximum,mean,median,standarddeviation,
invertedindex
Takeabreaktodiscusscombiners.
StructuredtoHierarchicalPatterns:
Examples:combining2datasets.
Thislessonwillgofast.Youprobablywontremembereverythingaboutthesepatterns,butthats
okay.Youllknowthattheyexistandyoullbeabletoconsultthemwhenyoufindyourselffacing
aproblemthatyouthinkcouldbemoldedtofitoneofthesepatterns.
Filtering Patterns
Thesepatternshaveonethingincommon:theydontchangetheactualrecords.Recordsthat
passthefilter,areoutputexactlythewaytheycamein.Thesepatternsallfindasubsetofdata,
whetheritbesmall,likeatoptenlisting,orlarge,likeallresultsfromthelastyearfromadataset
thatcontainsrecordsfor5years.Thereareseveralpossibleapplicationsoffilteringpatterns:
Copyright2014Udacity,Inc.AllRightsReserved.
Asimplefilteringboilsdowntohavingafunctionthat
givenaninputreturnskeeporthrowaway.Mapper
appliesthisfunctiontoeveryinputelement,and
accordinglytheoutputwillbeafilteredsubsetofthe
inputcontainingonlydatathatisinterestingtoyou
Samplingisaboutpullingoutasubsetofthedataforfutureprocessing.Typically,you'duse
samplingtoextractasmallerbutrepresentativedatasetonwhichyoucouldthenperformfurther
analysis.However,takingjustthefirst10,000recordsfromamassivedatasetmightnotbethe
bestideabecauseitmightnotberepresentative.
Instead,youcouldusePython'srandomnumber
generatortoreturnjust,say,1%ofthedatabyonly
passingitthroughtheMapperifarandomvalue
between1and100equalled1.
Top10:veryinterestingcase,wealllovetoseetop
lists!
Sampling
Letshavesomepracticewithfiltering!Thistimewewillworkwitha
datasetthatisgeneratedbyourveryownUdacitystudentsour
forums!Weareinterestedingettingasubsetofdatafromthe
forumthatcontainsonlypoststhatare1sentenceorless(so,
containsonlyoneperiod/!/?character,andperiodisthelast
significantcharacterinthepost).
Top10
AninterestingapplicationofmapreduceismakingtopNrecordlists.InRDBMSyouwould
normallyfirstsortthedata,thentaketopNrecords.Inmapreducethiskindofapproachwillnot
work,becausethedataisnotsortedandisprocessedonseveralmachines.Thus,themappers
willfirsthavetofindtheirowntopNlists,withoutsortingthedata,andthensendthelocalliststo
thereducerswhothencanfindtheglobaltopNlist.
Copyright2014Udacity,Inc.AllRightsReserved.
Letsfindthetop5longestpostsinourforum!
SummarizationPatterns
Thenextsetofdesignpatternswewilldiscusswillbe
patternsthatproduceatoplevel,summarizedviewof
yourdata.Thisisverypowerfulapproachtogetabirds
eyeviewofyourdata,somethingthatyoucannotgetby
justmanuallylookingatpartsofthedata.Youcangroup
similardatatogether,forexamplebyday,orbytimeof
day,orbyusernameandthencalculateastatisticof
somevalue,likemin/max/average/meanetc.Youcanalso
buildanindex,orjustsimplycountoccurrencesof
something.
Therearecoupleofproblemsthatcanbesolvedbyusing
similarapproach:
numericalsummarizationsfindingcountofvalue,
aswellasmin,max,avg,medianandother
numericalvaluesforyourdataset
invertedindex:importantwhenbuildingasearch
engineorfulltextsearchfunctionalityforyourown
websiteorapplication.
YoucanusebuiltinHadoopfunctionalitytoperformsome
Copyright2014Udacity,Inc.AllRightsReserved.
oftheseoperationsmoreefficientlyandwetalkaboutthataswell.
Inverted Index
Itisoftenneededtobuildareverseindexfroma
dataset,toenablefastersearching.The
obviousexamplewouldbeawebsearch
engine.Youneedtocreateamappingfrom
keywordstoweblinks,toenablefasterfinding
ofrelevantinformation.Thinkofitasanindex
forabookyouhaveaword,orterm,andall
pagesyoucanfindthisterm.
Whilethepatternitselfissimplemapper
outputseachwordasthekey,thelinkbeingthe
value,youhavetobeconsciousofthefactthat
thistypeofmapreduceissusceptibletounevenkeydistribution.Forexample,youcanassume
thatthewordthewillbefoundalotmoreinatextthanotherwords.Youhavetothinkifyou
reallyneedtoincludesuchwordsintheindex.
Numerical summarizations
Commonusesforthistypeofanalysisare:
wordorrecordcount(thatyoualreadyusedwhen
analyzingwebserverlogfiles).Mapperjustoutputsthe
thingthatyouareinterestedasakey,and1asavalue.
Reducerthencanjustsumupthevalues.Anothervery
popularexampleiswordcount,whichmightbethe
canonicalHello,WorldforMapReducecounthow
manytimesawordappearsinadataset.
min/max/countforaparticularevent,for
examplefirst/lasttimeauserpostedona
forum,orfirst/lasttimeandhowmany
particularitemwasboughtinashop.
Copyright2014Udacity,Inc.AllRightsReserved.
Letssaythatyouwanttoknowifthereisanycorrelationbetweendayofweekandhowmuch
moneypeoplespendonitems.Yourtaskistofindoutmeanandstandarddeviationforsaleper
dayofweek.
Mapperwillneedtohaveonlyoutputthedayofweekasthekeyandthesaleamountasthe
value.Reducerhastodoallthemaths.
IngeneralifyouneedtofindoutAverage/Median/Standarddeviation,thesewouldhavetobe
calculatedinthereducer.
Combiners
Tocalculatethemeaninthatlastproblem,whatdidyoudo?Youmayhavedonesomethinglike
this:
1. Yourmappermayhavegonethroughtherecordsandoutputakeyvaluepairthatlooked
like:dayofweekvalue.
2. Foreachdayoftheweek,yourreducerkeptarunningtotalofthevalueaswellasa
countofthenumberofrecords.
3. Youdividedthetotalvaluebythenumberofrecordstogetthemean.
Now,letsthinkaboutWHEREeachofthesestepstookplace.
1. Onvariousmachinesthroughoutthenetwork.
2. OnONEmachineinthenetwork.
3. Again,ONEmachine.
Copyright2014Udacity,Inc.AllRightsReserved.
Buttheresaproblemhere.Thatsecondstepinvolvesmovingalotofdataaroundyournetwork.
Whatifwecoulddosomeofthereductionlocallybeforesendingthedatatothereducers?
Wecan!Wecanusecombiners!Combinerswill,inessence,doasmuchreductionaspossible
locallybeforesendingthatdatatothereducers.
Thismightsavesignificantnetworktrafficinyouhavealotofrecords,butmuchless
uniquekeys.Youwillneedtoaddadditionalcommandtothecommandlinescripttouse
thisfunctionalityorusethefulljavacommand.Pleaseseeinstructorcommentsfor
detailedinformation.
Whenyourunajob,youwillseesomeoutputonthescreen,whichincludesatracking
url:
Openitinabrowser
Youwillseeajobpage,containinginformationaboutthejob,asitisbeingrun
Herearethecomparisonscreenshotsfrom2jobs,onerunwithoutcombinerandone
withacombiner:
Copyright2014Udacity,Inc.AllRightsReserved.
Asyoucansee,whenusingareducer(secondscreenshot),reducersgetsignificantlyless
recordsandhavetoshufflelessbytesthanwithoutacombiner.Withoutcombiner4,138,476
records,withcombiner412records.Whileitdoesnotleadtotimesavingsonasinglenode
pseudodistributedclusterruninaVM,likeyouhave,inrealworlditcouldsavealotofnetwork
traffic.
Generally,wewantourcombinerstodothesamethingasourreducers.Also,keepinmindthat
thisintermediatestepmayintroducesomecomplicationswhenwetrytodoourcomputation.
Youllencounterthatinthenextexercise.
CalculatingtheSumwithCombiners
Trycalculatingthesumofvalues,likeyoudidinlesson3,butthistimeuseacombiner.
WhenmigratingdatafromanRDBMSto
aHadoopsystem,oneofthefirstthings
youshouldconsiderdoingis
reformattingyourdata.SinceHadoop
doesntcarewhatformatyourdataisin,
youcantakeadvantageofhierarchical
datatoavoiddoinglivejoinsofseveral
datasetsatthetimeofanalysis.Ifyou
knowwhatkindofinformationyouwill
Copyright2014Udacity,Inc.AllRightsReserved.
wanttogetlater,youmightsavesignificanttimebyreformattingthedata.
Youcanusethispatternifyouhavedatasourcesthatarelinkedbysomesetofforeignkeysand
yourdataisstructuredandrowbased.
Combine 2 datasets
Youmightwanttoknowwhatthereputationoftheauthorofapostis.Combinepostanduser
tablestoproduceasimpledatasetthathasalltherelevantinformationtogether,orinRDBMS
termsisdenormalized.
Posttablewillcontaininformationlikethis:
Usertablewillhavethis:
Wewanttocombinethetables,sothat
eachpostcontainsinformationaboutthe
reputationandbadgesfortheauthor,so
thatwecanrunsomeanalysison
contentofapostinrelationtoreputation
ofauser.Forexample,istherea
correlationbetweenpostlengthanduser
reputation?
Conclusion
Wevecoveredseveralpatternswhatgoodarethey?
Well,youllhavetopracticewiththemtohavethembecomesecondnature,butfornowyoucan
askyourselfwhensolvingaproblem:DoesthisproblemfitintooneofthepatternsIlearned?Is
itafilteringproblem?Asummarizationproblem?aHierarchicaldataproblem?
Thesearenotallthepatternsyoumaywanttouse.Andgettingyourselftoaskthisquestionis
Copyright2014Udacity,Inc.AllRightsReserved.
notsimple.Infact,justidentifyingwhenmapreduceistheappropriatetoolforaproblemisa
largepartofwhatitmeanstobeamapreduceexpert.
Ifyouwanttogainthisexpertise,youllhavetopractice.Thismayhappennaturallyinajob
environment,oryoucouldmakeithappendeliberatelybyseekingoutproblemsandtryingto
solvethemwiththisframework.
Ifyouenjoylearningaboutthesepatterns,youcanfindmoreinthebook
Niceworkandcongratulations.Now,ontothefinalproject!Goodluck.
Copyright2014Udacity,Inc.AllRightsReserved.