Lesson 4 Notes PDF

Intro to Hadoop and MapReduce
Lesson 4 Notes
Designing With Patterns

MynameisAndy,Imgoingtobepresentingthislesson,wherewearegoingto
learnaboutdesignpatternsinmapreduce.Thepatternsthatwewillcoverare
templatesforsolvingcommonproblemswithmapreduce.Thesepatternsachieve
agoodbalancebetweenflexibilityandrigidity:Theyaregeneralenoughthatthey
canbeadaptedtosolvelotsofproblems,butspecificenoughthattheydontrequire
toomuchefforttouse.Thegoalhereis
togiveyoufamiliarity(butnot
necessarilymastery)withsomeof
thesepatterns.Wewontdiscusswhere
thesepatternscamefromorwhythey
werechosen.Wearejustgoingto
introducethem.Note:thesepatterns
comefromthebook:MapReduceDesignPatternsbyDonaldMiner&AdamShook.
What we will Cover
NoteveryproblemcanbesolvedwithMapReduce.Infact,problemshavetobemanipulatedto
fitintothisframeworkofmappingandreducing.Butluckilyprogrammershavedevelopeda
sophisticatedarsenalofpatternstomakesomepotentiallyunexpectedproblems
mapreducable.Inthislesson,wellintroducethefollowingpatterns.
Copyright2014Udacity,Inc.AllRightsReserved.

FilteringPatterns:dontchangerecords.Onlygetapartofthedata!
Examples:sampling,randomsampling,top10
SummarizationPatterns:Givetoplevelviewofdata.
Examples:counting,minimum,maximum,mean,median,standarddeviation,
invertedindex
Takeabreaktodiscusscombiners.
StructuredtoHierarchicalPatterns:
Examples:combining2datasets.
Thislessonwillgofast.Youprobablywontremembereverythingaboutthesepatterns,butthats
okay.Youllknowthattheyexistandyoullbeabletoconsultthemwhenyoufindyourselffacing
aproblemthatyouthinkcouldbemoldedtofitoneofthesepatterns.
Filtering Patterns
Thesepatternshaveonethingincommon:theydontchangetheactualrecords.Recordsthat
passthefilter,areoutputexactlythewaytheycamein.Thesepatternsallfindasubsetofdata,
whetheritbesmall,likeatoptenlisting,orlarge,likeallresultsfromthelastyearfromadataset
thatcontainsrecordsfor5years.Thereareseveralpossibleapplicationsoffilteringpatterns:


Asimplefilteringboilsdowntohavingafunctionthat
givenaninputreturnskeeporthrowaway.Mapper
appliesthisfunctiontoeveryinputelement,and
accordinglytheoutputwillbeafilteredsubsetofthe
inputcontainingonlydatathatisinterestingtoyou
Samplingisaboutpullingoutasubsetofthedataforfutureprocessing.Typically,you'duse
samplingtoextractasmallerbutrepresentativedatasetonwhichyoucouldthenperformfurther
analysis.However,takingjustthefirst10,000recordsfromamassivedatasetmightnotbethe
bestideabecauseitmightnotberepresentative.
Instead,youcouldusePython'srandomnumber
generatortoreturnjust,say,1%ofthedatabyonly
passingitthroughtheMapperifarandomvalue
between1and100equalled1.
Top10:veryinterestingcase,wealllovetoseetop
lists!
Sampling

Letshavesomepracticewithfiltering!Thistimewewillworkwitha
datasetthatisgeneratedbyourveryownUdacitystudentsour
forums!Weareinterestedingettingasubsetofdatafromthe
forumthatcontainsonlypoststhatare1sentenceorless(so,
containsonlyoneperiod/!/?character,andperiodisthelast
significantcharacterinthepost).

Top10
AninterestingapplicationofmapreduceismakingtopNrecordlists.InRDBMSyouwould
normallyfirstsortthedata,thentaketopNrecords.Inmapreducethiskindofapproachwillnot
work,becausethedataisnotsortedandisprocessedonseveralmachines.Thus,themappers
willfirsthavetofindtheirowntopNlists,withoutsortingthedata,andthensendthelocalliststo
thereducerswhothencanfindtheglobaltopNlist.

Letsfindthetop5longestpostsinourforum!
SummarizationPatterns
Thenextsetofdesignpatternswewilldiscusswillbe
patternsthatproduceatoplevel,summarizedviewof
yourdata.Thisisverypowerfulapproachtogetabirds
eyeviewofyourdata,somethingthatyoucannotgetby
justmanuallylookingatpartsofthedata.Youcangroup
similardatatogether,forexamplebyday,orbytimeof
day,orbyusernameandthencalculateastatisticof
somevalue,likemin/max/average/meanetc.Youcanalso
buildanindex,orjustsimplycountoccurrencesof
something.
Therearecoupleofproblemsthatcanbesolvedbyusing
similarapproach:
numericalsummarizationsfindingcountofvalue,
aswellasmin,max,avg,medianandother
numericalvaluesforyourdataset
invertedindex:importantwhenbuildingasearch
engineorfulltextsearchfunctionalityforyourown
websiteorapplication.
YoucanusebuiltinHadoopfunctionalitytoperformsome

oftheseoperationsmoreefficientlyandwetalkaboutthataswell.
Inverted Index
Itisoftenneededtobuildareverseindexfroma
dataset,toenablefastersearching.The
obviousexamplewouldbeawebsearch
engine.Youneedtocreateamappingfrom
keywordstoweblinks,toenablefasterfinding
ofrelevantinformation.Thinkofitasanindex
forabookyouhaveaword,orterm,andall
pagesyoucanfindthisterm.
Whilethepatternitselfissimplemapper
outputseachwordasthekey,thelinkbeingthe
value,youhavetobeconsciousofthefactthat
thistypeofmapreduceissusceptibletounevenkeydistribution.Forexample,youcanassume
thatthewordthewillbefoundalotmoreinatextthanotherwords.Youhavetothinkifyou
reallyneedtoincludesuchwordsintheindex.
Numerical summarizations
Commonusesforthistypeofanalysisare:
wordorrecordcount(thatyoualreadyusedwhen
analyzingwebserverlogfiles).Mapperjustoutputsthe
thingthatyouareinterestedasakey,and1asavalue.
Reducerthencanjustsumupthevalues.Anothervery
popularexampleiswordcount,whichmightbethe
canonicalHello,WorldforMapReducecounthow
manytimesawordappearsinadataset.
min/max/countforaparticularevent,for
examplefirst/lasttimeauserpostedona
forum,orfirst/lasttimeandhowmany
particularitemwasboughtinashop.

Mean and Standard Deviation

Letssaythatyouwanttoknowifthereisanycorrelationbetweendayofweekandhowmuch
moneypeoplespendonitems.Yourtaskistofindoutmeanandstandarddeviationforsaleper
dayofweek.
Mapperwillneedtohaveonlyoutputthedayofweekasthekeyandthesaleamountasthe
value.Reducerhastodoallthemaths.
IngeneralifyouneedtofindoutAverage/Median/Standarddeviation,thesewouldhavetobe
calculatedinthereducer.
Combiners
Tocalculatethemeaninthatlastproblem,whatdidyoudo?Youmayhavedonesomethinglike
this:
1. Yourmappermayhavegonethroughtherecordsandoutputakeyvaluepairthatlooked
like:dayofweekvalue.
2. Foreachdayoftheweek,yourreducerkeptarunningtotalofthevalueaswellasa
countofthenumberofrecords.
3. Youdividedthetotalvaluebythenumberofrecordstogetthemean.
Now,letsthinkaboutWHEREeachofthesestepstookplace.
1. Onvariousmachinesthroughoutthenetwork.
2. OnONEmachineinthenetwork.
3. Again,ONEmachine.


Buttheresaproblemhere.Thatsecondstepinvolvesmovingalotofdataaroundyournetwork.
Whatifwecoulddosomeofthereductionlocallybeforesendingthedatatothereducers?
Wecan!Wecanusecombiners!Combinerswill,inessence,doasmuchreductionaspossible
locallybeforesendingthatdatatothereducers.
Thismightsavesignificantnetworktrafficinyouhavealotofrecords,butmuchless
uniquekeys.Youwillneedtoaddadditionalcommandtothecommandlinescripttouse
thisfunctionalityorusethefulljavacommand.Pleaseseeinstructorcommentsfor
detailedinformation.
Whenyourunajob,youwillseesomeoutputonthescreen,whichincludesatracking
url:
Openitinabrowser
Youwillseeajobpage,containinginformationaboutthejob,asitisbeingrun
Herearethecomparisonscreenshotsfrom2jobs,onerunwithoutcombinerandone
withacombiner:

Asyoucansee,whenusingareducer(secondscreenshot),reducersgetsignificantlyless
recordsandhavetoshufflelessbytesthanwithoutacombiner.Withoutcombiner4,138,476
records,withcombiner412records.Whileitdoesnotleadtotimesavingsonasinglenode
pseudodistributedclusterruninaVM,likeyouhave,inrealworlditcouldsavealotofnetwork
traffic.
Generally,wewantourcombinerstodothesamethingasourreducers.Also,keepinmindthat
thisintermediatestepmayintroducesomecomplicationswhenwetrytodoourcomputation.
Youllencounterthatinthenextexercise.
CalculatingtheSumwithCombiners
Trycalculatingthesumofvalues,likeyoudidinlesson3,butthistimeuseacombiner.
Structured to Hierarchical Patterns
WhenmigratingdatafromanRDBMSto
aHadoopsystem,oneofthefirstthings
youshouldconsiderdoingis
reformattingyourdata.SinceHadoop
doesntcarewhatformatyourdataisin,
youcantakeadvantageofhierarchical
datatoavoiddoinglivejoinsofseveral
datasetsatthetimeofanalysis.Ifyou
knowwhatkindofinformationyouwill

wanttogetlater,youmightsavesignificanttimebyreformattingthedata.
Youcanusethispatternifyouhavedatasourcesthatarelinkedbysomesetofforeignkeysand
yourdataisstructuredandrowbased.
Combine 2 datasets
Youmightwanttoknowwhatthereputationoftheauthorofapostis.Combinepostanduser
tablestoproduceasimpledatasetthathasalltherelevantinformationtogether,orinRDBMS
termsisdenormalized.
Posttablewillcontaininformationlikethis:
Usertablewillhavethis:
Wewanttocombinethetables,sothat
eachpostcontainsinformationaboutthe
reputationandbadgesfortheauthor,so
thatwecanrunsomeanalysison
contentofapostinrelationtoreputation
ofauser.Forexample,istherea
correlationbetweenpostlengthanduser
reputation?
Conclusion
Wevecoveredseveralpatternswhatgoodarethey?
Well,youllhavetopracticewiththemtohavethembecomesecondnature,butfornowyoucan
askyourselfwhensolvingaproblem:DoesthisproblemfitintooneofthepatternsIlearned?Is
itafilteringproblem?Asummarizationproblem?aHierarchicaldataproblem?

Thesearenotallthepatternsyoumaywanttouse.Andgettingyourselftoaskthisquestionis

notsimple.Infact,justidentifyingwhenmapreduceistheappropriatetoolforaproblemisa
largepartofwhatitmeanstobeamapreduceexpert.

Ifyouwanttogainthisexpertise,youllhavetopractice.Thismayhappennaturallyinajob
environment,oryoucouldmakeithappendeliberatelybyseekingoutproblemsandtryingto
solvethemwiththisframework.
Ifyouenjoylearningaboutthesepatterns,youcanfindmoreinthebook
Niceworkandcongratulations.Now,ontothefinalproject!Goodluck.

Lesson 4 Notes PDF

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lesson 4 Notes PDF

Transféré par

Droits d'auteur :

Formats disponibles

Intro to Hadoop and MapReduce

Designing With Patterns

What we will Cover

Mean and Standard Deviation

Structured to Hierarchical Patterns

Vous aimerez peut-être aussi