Vous êtes sur la page 1sur 10

Intro to Hadoop and MapReduce

Lesson 4 Notes

Designing With Patterns


MynameisAndy,Imgoingtobepresentingthislesson,wherewearegoingto
learnaboutdesignpatternsinmapreduce.Thepatternsthatwewillcoverare
templatesforsolvingcommonproblemswithmapreduce.Thesepatternsachieve
agoodbalancebetweenflexibilityandrigidity:Theyaregeneralenoughthatthey
canbeadaptedtosolvelotsofproblems,butspecificenoughthattheydontrequire
toomuchefforttouse.Thegoalhereis
togiveyoufamiliarity(butnot
necessarilymastery)withsomeof
thesepatterns.Wewontdiscusswhere
thesepatternscamefromorwhythey
werechosen.Wearejustgoingto
introducethem.Note:thesepatterns
comefromthebook:MapReduceDesignPatternsbyDonaldMiner&AdamShook.

What we will Cover

NoteveryproblemcanbesolvedwithMapReduce.Infact,problemshavetobemanipulatedto
fitintothisframeworkofmappingandreducing.Butluckilyprogrammershavedevelopeda
sophisticatedarsenalofpatternstomakesomepotentiallyunexpectedproblems
mapreducable.Inthislesson,wellintroducethefollowingpatterns.

Copyright2014Udacity,Inc.AllRightsReserved.

FilteringPatterns:dontchangerecords.Onlygetapartofthedata!
Examples:sampling,randomsampling,top10
SummarizationPatterns:Givetoplevelviewofdata.
Examples:counting,minimum,maximum,mean,median,standarddeviation,
invertedindex
Takeabreaktodiscusscombiners.
StructuredtoHierarchicalPatterns:
Examples:combining2datasets.

Thislessonwillgofast.Youprobablywontremembereverythingaboutthesepatterns,butthats
okay.Youllknowthattheyexistandyoullbeabletoconsultthemwhenyoufindyourselffacing
aproblemthatyouthinkcouldbemoldedtofitoneofthesepatterns.

Filtering Patterns

Thesepatternshaveonethingincommon:theydontchangetheactualrecords.Recordsthat
passthefilter,areoutputexactlythewaytheycamein.Thesepatternsallfindasubsetofdata,
whetheritbesmall,likeatoptenlisting,orlarge,likeallresultsfromthelastyearfromadataset
thatcontainsrecordsfor5years.Thereareseveralpossibleapplicationsoffilteringpatterns:


Copyright2014Udacity,Inc.AllRightsReserved.

Asimplefilteringboilsdowntohavingafunctionthat
givenaninputreturnskeeporthrowaway.Mapper
appliesthisfunctiontoeveryinputelement,and
accordinglytheoutputwillbeafilteredsubsetofthe
inputcontainingonlydatathatisinterestingtoyou

Samplingisaboutpullingoutasubsetofthedataforfutureprocessing.Typically,you'duse
samplingtoextractasmallerbutrepresentativedatasetonwhichyoucouldthenperformfurther
analysis.However,takingjustthefirst10,000recordsfromamassivedatasetmightnotbethe
bestideabecauseitmightnotberepresentative.
Instead,youcouldusePython'srandomnumber
generatortoreturnjust,say,1%ofthedatabyonly
passingitthroughtheMapperifarandomvalue
between1and100equalled1.

Top10:veryinterestingcase,wealllovetoseetop
lists!

Sampling

Letshavesomepracticewithfiltering!Thistimewewillworkwitha
datasetthatisgeneratedbyourveryownUdacitystudentsour
forums!Weareinterestedingettingasubsetofdatafromthe
forumthatcontainsonlypoststhatare1sentenceorless(so,
containsonlyoneperiod/!/?character,andperiodisthelast
significantcharacterinthepost).


Top10

AninterestingapplicationofmapreduceismakingtopNrecordlists.InRDBMSyouwould
normallyfirstsortthedata,thentaketopNrecords.Inmapreducethiskindofapproachwillnot
work,becausethedataisnotsortedandisprocessedonseveralmachines.Thus,themappers
willfirsthavetofindtheirowntopNlists,withoutsortingthedata,andthensendthelocalliststo
thereducerswhothencanfindtheglobaltopNlist.

Copyright2014Udacity,Inc.AllRightsReserved.

Letsfindthetop5longestpostsinourforum!

SummarizationPatterns

Thenextsetofdesignpatternswewilldiscusswillbe
patternsthatproduceatoplevel,summarizedviewof
yourdata.Thisisverypowerfulapproachtogetabirds
eyeviewofyourdata,somethingthatyoucannotgetby
justmanuallylookingatpartsofthedata.Youcangroup
similardatatogether,forexamplebyday,orbytimeof
day,orbyusernameandthencalculateastatisticof
somevalue,likemin/max/average/meanetc.Youcanalso
buildanindex,orjustsimplycountoccurrencesof
something.

Therearecoupleofproblemsthatcanbesolvedbyusing
similarapproach:

numericalsummarizationsfindingcountofvalue,
aswellasmin,max,avg,medianandother
numericalvaluesforyourdataset
invertedindex:importantwhenbuildingasearch
engineorfulltextsearchfunctionalityforyourown
websiteorapplication.

YoucanusebuiltinHadoopfunctionalitytoperformsome

Copyright2014Udacity,Inc.AllRightsReserved.

oftheseoperationsmoreefficientlyandwetalkaboutthataswell.

Inverted Index

Itisoftenneededtobuildareverseindexfroma
dataset,toenablefastersearching.The
obviousexamplewouldbeawebsearch
engine.Youneedtocreateamappingfrom
keywordstoweblinks,toenablefasterfinding
ofrelevantinformation.Thinkofitasanindex
forabookyouhaveaword,orterm,andall
pagesyoucanfindthisterm.

Whilethepatternitselfissimplemapper
outputseachwordasthekey,thelinkbeingthe
value,youhavetobeconsciousofthefactthat
thistypeofmapreduceissusceptibletounevenkeydistribution.Forexample,youcanassume
thatthewordthewillbefoundalotmoreinatextthanotherwords.Youhavetothinkifyou
reallyneedtoincludesuchwordsintheindex.

Numerical summarizations

Commonusesforthistypeofanalysisare:
wordorrecordcount(thatyoualreadyusedwhen
analyzingwebserverlogfiles).Mapperjustoutputsthe
thingthatyouareinterestedasakey,and1asavalue.
Reducerthencanjustsumupthevalues.Anothervery
popularexampleiswordcount,whichmightbethe
canonicalHello,WorldforMapReducecounthow
manytimesawordappearsinadataset.

min/max/countforaparticularevent,for
examplefirst/lasttimeauserpostedona
forum,orfirst/lasttimeandhowmany
particularitemwasboughtinashop.

Mean and Standard Deviation

Copyright2014Udacity,Inc.AllRightsReserved.

Letssaythatyouwanttoknowifthereisanycorrelationbetweendayofweekandhowmuch
moneypeoplespendonitems.Yourtaskistofindoutmeanandstandarddeviationforsaleper
dayofweek.

Mapperwillneedtohaveonlyoutputthedayofweekasthekeyandthesaleamountasthe
value.Reducerhastodoallthemaths.

IngeneralifyouneedtofindoutAverage/Median/Standarddeviation,thesewouldhavetobe
calculatedinthereducer.

Combiners

Tocalculatethemeaninthatlastproblem,whatdidyoudo?Youmayhavedonesomethinglike
this:
1. Yourmappermayhavegonethroughtherecordsandoutputakeyvaluepairthatlooked
like:dayofweekvalue.
2. Foreachdayoftheweek,yourreducerkeptarunningtotalofthevalueaswellasa
countofthenumberofrecords.
3. Youdividedthetotalvaluebythenumberofrecordstogetthemean.

Now,letsthinkaboutWHEREeachofthesestepstookplace.
1. Onvariousmachinesthroughoutthenetwork.
2. OnONEmachineinthenetwork.
3. Again,ONEmachine.





Copyright2014Udacity,Inc.AllRightsReserved.

Buttheresaproblemhere.Thatsecondstepinvolvesmovingalotofdataaroundyournetwork.
Whatifwecoulddosomeofthereductionlocallybeforesendingthedatatothereducers?

Wecan!Wecanusecombiners!Combinerswill,inessence,doasmuchreductionaspossible
locallybeforesendingthatdatatothereducers.

Thismightsavesignificantnetworktrafficinyouhavealotofrecords,butmuchless
uniquekeys.Youwillneedtoaddadditionalcommandtothecommandlinescripttouse
thisfunctionalityorusethefulljavacommand.Pleaseseeinstructorcommentsfor
detailedinformation.
Whenyourunajob,youwillseesomeoutputonthescreen,whichincludesatracking
url:

Openitinabrowser
Youwillseeajobpage,containinginformationaboutthejob,asitisbeingrun

Herearethecomparisonscreenshotsfrom2jobs,onerunwithoutcombinerandone
withacombiner:

Copyright2014Udacity,Inc.AllRightsReserved.

Asyoucansee,whenusingareducer(secondscreenshot),reducersgetsignificantlyless
recordsandhavetoshufflelessbytesthanwithoutacombiner.Withoutcombiner4,138,476
records,withcombiner412records.Whileitdoesnotleadtotimesavingsonasinglenode
pseudodistributedclusterruninaVM,likeyouhave,inrealworlditcouldsavealotofnetwork
traffic.

Generally,wewantourcombinerstodothesamethingasourreducers.Also,keepinmindthat
thisintermediatestepmayintroducesomecomplicationswhenwetrytodoourcomputation.
Youllencounterthatinthenextexercise.

CalculatingtheSumwithCombiners

Trycalculatingthesumofvalues,likeyoudidinlesson3,butthistimeuseacombiner.

Structured to Hierarchical Patterns

WhenmigratingdatafromanRDBMSto
aHadoopsystem,oneofthefirstthings
youshouldconsiderdoingis
reformattingyourdata.SinceHadoop
doesntcarewhatformatyourdataisin,
youcantakeadvantageofhierarchical
datatoavoiddoinglivejoinsofseveral
datasetsatthetimeofanalysis.Ifyou
knowwhatkindofinformationyouwill
Copyright2014Udacity,Inc.AllRightsReserved.

wanttogetlater,youmightsavesignificanttimebyreformattingthedata.

Youcanusethispatternifyouhavedatasourcesthatarelinkedbysomesetofforeignkeysand
yourdataisstructuredandrowbased.

Combine 2 datasets

Youmightwanttoknowwhatthereputationoftheauthorofapostis.Combinepostanduser
tablestoproduceasimpledatasetthathasalltherelevantinformationtogether,orinRDBMS
termsisdenormalized.

Posttablewillcontaininformationlikethis:

Usertablewillhavethis:

Wewanttocombinethetables,sothat
eachpostcontainsinformationaboutthe
reputationandbadgesfortheauthor,so
thatwecanrunsomeanalysison
contentofapostinrelationtoreputation
ofauser.Forexample,istherea
correlationbetweenpostlengthanduser
reputation?

Conclusion

Wevecoveredseveralpatternswhatgoodarethey?

Well,youllhavetopracticewiththemtohavethembecomesecondnature,butfornowyoucan
askyourselfwhensolvingaproblem:DoesthisproblemfitintooneofthepatternsIlearned?Is
itafilteringproblem?Asummarizationproblem?aHierarchicaldataproblem?


Thesearenotallthepatternsyoumaywanttouse.Andgettingyourselftoaskthisquestionis

Copyright2014Udacity,Inc.AllRightsReserved.

notsimple.Infact,justidentifyingwhenmapreduceistheappropriatetoolforaproblemisa
largepartofwhatitmeanstobeamapreduceexpert.


Ifyouwanttogainthisexpertise,youllhavetopractice.Thismayhappennaturallyinajob
environment,oryoucouldmakeithappendeliberatelybyseekingoutproblemsandtryingto
solvethemwiththisframework.

Ifyouenjoylearningaboutthesepatterns,youcanfindmoreinthebook

Niceworkandcongratulations.Now,ontothefinalproject!Goodluck.

Copyright2014Udacity,Inc.AllRightsReserved.

Vous aimerez peut-être aussi