Vous êtes sur la page 1sur 13

PipelinePattern

Problem
Supposethattheoverallcomputationinvolvesperformingacalculationonmany setsofdata,wherethecalculationcanbeviewedintermsofdataflowingthrougha sequenceofstages.Howcanthepotentialconcurrencybeexploited?

Context
Anassemblylineisagoodanalogyforthispattern.Supposewewantto manufactureanumberofcars.Themanufacturingprocesscanbebrokendowninto asequenceofoperationseachofwhichaddssomecomponent,saytheengineorthe windshield,tothecar.Anassemblyline(pipeline)assignsacomponenttoeach worker.Aseachcarmovesdowntheassemblyline,eachworkerinstallsthesame componentoverandoveronasuccessionofcars.Afterthepipelineisfull(anduntil itstartstoempty)theworkerscanallbebusysimultaneously,allperformingtheir operationsonthecarsthatarecurrentlyattheirstations. Examplesofpipelinesarefoundatmanylevelsofgranularityincomputersystems, includingtheCPUhardwareitself. InstructionpipelineinmodernCPUs.Thestages(fetchinstruction, decode,execute,memorystage,writeback,etc.)aredoneinapipelined fashion;whileoneinstructionisbeingdecoded,itspredecessorisbeing executedanditssuccessorisbeingfetched. Vectorprocessing(looplevelpipelining).Specializedhardwareinsome supercomputersallowsoperationsonvectorstobeperformedinapipelined fashion.Typically,acompilerisexpectedtorecognizethataloopsuchas

for (i=0; i<N; i++) { a[i] = b[i] + c[i]; } in assembly: li vlr, 64 # vector length is 64 lv v1, r1 # load vector b from memory lv v2, r2 # load vector c from memory addv.d v3, v1, v2 # add vector a = b + c sv v3, r3 # store vector a to memory canbevectorizedinawaythatthespecialhardwarecanexploit.Aftera shortstartup,onea[i]valuewillbegeneratedeachclockcycle.

Algorithmlevelpipelining.Manyalgorithmscanbeformulatedas recurrencerelationsandimplementedusingapipelineoritshigher dimensionalgeneralization,asystolicarray.Suchimplementationsoften exploitspecializedhardwareforperformancereasons. Signalprocessing.Passingastreamofrealtimesensordatathrougha sequenceoffilterscanbemodeledasapipeline,witheachfilter correspondingtoastageinthepipeline. Graphics.Processingasequenceofimagesbyapplyingthesamesequence ofoperationstoeachimagecanbemodeledasapipeline,witheach operationcorrespondingtoapipelinestage.Somestagesmaybe implementedbyspecializedhardware. ShellprogramsinUNIX.Forexample,theshellcommand

cat file | grep word | wc l createsathreestagepipeline,withoneprocessforeachcommand(cat,grep, andwc) Theseexamplesandtheassemblylineanalogyhaveseveralaspectsincommon.All involveapplyingasequenceofoperations(intheassemblylinecaseitisinstalling theengine,installingthewindshield,etc.)toeachelementinasequenceofdata elements(intheassemblyline,thecars).Althoughtheremaybeordering constraintsontheoperationsonasingledataelement(forexample,itmightbe necessarytoinstalltheenginebeforeinstallingthehood),itispossibletoperform differentoperationsondifferentdataelementssimultaneously(forexample,one caninstalltheengineononecarwhileinstallingthehoodonanother.) Thepossibilityofsimultaneouslyperformingdifferentoperationsondifferentdata elementsisthepotentialconcurrencythispatternexploits.Eachtaskconsistsof repeatedlyapplyinganoperationtoadataelement(analogoustoanassemblyline workerinstallingacomponent),andthedependenciesamongtasksareordering constraintsenforcingtheorderinwhichoperationsmustbeperformedoneach dataelement(analogoustoinstallingtheenginebeforethehood).

Forces
Universalforces o Deeppipelineorshortpipeline(Throughputorlatency).Thereis adesigndecisionwhenusingaPipelinepattern.Ifthethroughputis importantthedepthofthepipelineshouldbedeep,ifthelatencyis importantthedepthofthepipelineshouldbeshort.However,ifthe depthofthepipelinegetsshorter,theamountofparallelism decreases.

InmanysituationswherethePipelinepatternisused,the performancemeasureofinterestisthethroughput,the numberofdataitemspertimeunitthatcanbeprocessedafter thepipelineisalreadyfull.Forexample,iftheoutputofthe pipelineisasequenceofrenderedimagestobeviewedasan animation,thenthepipelinemusthavesufficientthroughput (numberofitemsprocessedpertimeunit)togeneratethe imagesattherequiredframerate. Inanothersituation,theinputmightbegeneratedfromreal timesamplingofsensordata.Inthiscase,theremightbe constraintsonboththethroughput(thepipelineshouldbe abletohandleallthedataasitcomesinwithoutbackingupthe inputqueueandpossiblylosingdata)andthelatency(the amountoftimebetweenthegenerationofaninputandthe completionofprocessingofthatinput).Inthiscase,itmight bedesirabletominimizelatencysubjecttoaconstraintthat thethroughputissufficienttohandletheincomingdata. Implementationforces o Multiprocessorsononenodeorseveralnodesonacluster.You mightendupusingonenodethathasseveralcoresorusingmore nodesonacluster.Ifyouuseseveralnodes,thecommunicationcost isnormallyexpensivethanusingseveralcoresononenode,sothe amountofworkexpectedbeforecommunicationoneachcompute nodeshouldbelarger. o Specialpurposehardwareorgeneralpurposehardware.Ifyou usespecialpurposehardwaretoexploitparallelism,theremightbe certainrestrictioninthefuturewhenmodifyingthepipeline. However,ifyouuseageneralpurposehardwarethepeak performanceyoumightachievecouldnotbeascompetitivewhen usingaspecialpurposehardwaresuchasvectormachines.

Solution
Thekeyideaofthispatterniscapturedbytheassemblylineanalogy,namelythat thepotentialconcurrencycanbeexploitedbyassigningeachoperation(stageofthe pipeline)toadifferentworkerandhavingthemworksimultaneously,withthedata elementspassingfromoneworkertothenextasoperationsarecompleted.In parallelprogrammingterms,theideaistoassigneachtask(stageofthepipeline)to aUEandprovideamechanismwherebyeachstageofthepipelinecansenddata elementstothenextstage.Thisstrategyisprobablythemoststraightforwardway todealwiththistypeoforderingconstraints.Itallowstheapplicationtotake advantageofspecialpurposehardwarebyappropriatemappingofpipelinestages toPEsandprovidesareasonablemechanismforhandlingerrors,describedlater.It alsoislikelytoyieldamodulardesignthatcanlaterbeextendedormodified.

time Pipelinestage1 Pipelinestage2 Pipelinestage3 Beforegoingfurther,itmayhelptoillustratehowthepipelineissupposedto operate.LetCirepresentamultistepcomputationondataelementi.Ci(j)isthejth stepofthecomputation.Theideaistomapcomputationstepstopipelinestagesso thateachstageofthepipelinecomputesonestep.Initially,thefirststageofthe pipelineperformsC1(1).Afterthatcompletes,thesecondstageofthepipeline receivesthefirstdataitemandcomputesC1(2)whilethefirststagecomputesthe firststepoftheseconditem,C2(1).Next,thethirdstagecomputesC1(3),whilethe secondstagecomputesC2(2)andthefirststageC3(1).Thefigureaboveillustrates howthisworksforapipelineconsistingoffourstages.Noticethatconcurrencyis initiallylimitedandsomeresourcesremainidleuntilallthestagesareoccupied withusefulwork.Thisisreferredtoasfillingthepipeline.Attheendofthe computation(drainingthepipeline),againthereislimitedconcurrencyandidle resourcesasthefinalitemworksitswaythroughthepipeline.Wewantthetime spentfillingordrainingthepipelinetobesmallcomparedtothetotaltimeofthe computation.Thiswillbethecaseifthenumberofstagesissmallcomparedtothe numberofitemstobeprocessed.Noticealsothatoverallthroughput/efficiencyis maximizedifthetimetakentoprocessadataelementisroughlythesameforeach stage. Thisideacanbeextendedtoincludesituationsmoregeneralthanacompletely linearpipeline.Forexample,thefigurebelowillustratestwopipelines,eachwith fourstages.Inthesecondpipeline,thethirdstageconsistsoftwooperationsthat canbeperformedconcurrently. Stage1 Stage2 Stage3 Stage4 C1 C2 C3 C4 C1 C2 C3 C4 C1 C2 C3 C4

Linearpipeline Stage1 Nonlinearpipeline Definethestagesofthepipeline.Normallyeachpipelinestagewillcorrespondto onetask.Thefigurebelowshowsthebasicstructureofeachstage. initialize while (more data) { receive data element from previous stage perform operation on data element send data element to next stage } finalize Ifthenumberofdataelementstobeprocessedisknowninadvance,theneachstage cancountthenumberofelementsandterminatewhenthesehavebeenprocessed. Alternatively,asentinelindicatingterminationmaybesentthroughthepipeline. Itisworthwhiletoconsideratthispointsomefactorsthataffectperformance. Theamountofconcurrencyinafullpipelineislimitedbythenumberof stages.Thus,alargernumberofstagesallowsmoreconcurrency.However, thedatasequencemustbetransferredbetweenthestages,introducing overheadtothecalculation.Thus,weneedtoorganizethecomputationinto stagessuchthattheworkdonebyastageislargecomparedtothe communicationoverhead.Whatislargeenoughishighlydependentonthe particulararchitecture.Specializedhardware(suchasvectorprocessors) allowsveryfinegrainedparallelism. Thepatternworksbetteriftheoperationsperformedbythevariousstages ofthepipelineareallaboutequallycomputationallyintensive.Ifthestages inthepipelinevarywidelyincomputationaleffort,thesloweststagecreates abottleneckfortheaggregatethroughput. Stage2 Stage3 Stage3 Stage4

Thepatternworksbetterifthetimerequiredtofillanddrainthepipelineis smallcomparedtotheoverallrunningtime.Thistimeisinfluencedbythe numberofstages(morestagesmeansmorefill/draintime).

Therefore,itisworthwhiletoconsiderwhethertheoriginaldecompositioninto tasksshouldberevisitedatthispoint,possiblycombininglightlyloadedadjacent pipelinestagesintoasinglestage,ordecomposingaheavilyloadedstageinto multiplestages. Itmayalsobeworthwhiletoparallelizeaheavilyloadedstageusingoneofthe otherParallelAlgorithmStrategypatterns.Forexample,ifthepipelineisprocessing asequenceofimages,itisoftenthecasethateachstagecanbeparallelizedusingthe TaskParallelismpattern. Structurethecomputation.Wealsoneedawaytostructuretheoverall computation.OnepossibilityistousetheSPMDpatternanduseeachUEsIDto selectanoptioninacaseorswitchstatement,witheachcasecorrespondingtoa stageofthepipeline. Toincreasemodularity,objectorientedframeworkscanbedevelopedthatallow stagestoberepresentedbyobjectsorproceduresthatcaneasilybepluggedinto thepipeline.SuchframeworksarenotdifficulttoconstructusingstandardOOP techniques,andseveralareavailableascommercialorfreelyavailableproducts. Representthedataflowamongpipelineelements.Howdataflowbetween pipelineelementsisrepresenteddependonthetargetplatform. Inamessagepassingenvironment,themostnaturalapproachistoassignon processtoeachoperation(stageofthepipeline)andimplementeachconnection betweensuccessivestagesofthepipelineasasequenceofmessagesbetweenthe correspondingprocesses.Becausethestagesarehardlyeverperfectly synchronized,andtheamountofworkcarriedoutatdifferentstagesalmostalways varies,thisflowofdatabetweenpipelinestagesmustusuallybebothbufferedand ordered.Mostmessagepassingenvironments(e.g.,MPI)makethiseasytodo.If thecostofsendingindividualmessagesishigh,itmaybeworthwhiletoconsider sendingmultipledataelementsineachmessage;thisreducestotalcommunication costattheexpenseofincreasingthetimeneededtofillthepipeline. Ifamessagepassingprogrammingenvironmentisnotagoodfitwiththetarget platform,thestagesofthepipelinecanbeconnectedexplicitlywithbuffered channels.Suchabufferedchannelcanbeimplementedasaqueuesharedbetween thesendingandreceivingtasks,usingtheSharedQueuepattern. Iftheindividualstagesarethemselvesimplementedasparallelprograms,then moresophisticatedapproachesmaybecalledfor,especiallyifsomesortofdata redistributionneedstobeperformedbetweenthestages.Thismightbethecaseif, forexample,thedataneedstobepartitionedalongadifferentdimensionor partitionedintoadifferentnumberofsubsetsinthesamedimension.Forexample,

anapplicationmightincludeonestageinwhicheachdataelementispartitioned intothreesubsetsandanotherstageinwhichitispartitionedintofoursubsets. Simplewaystohandlesuchsituationsaretoaggregateanddisaggregatedata elementsbetweenstages.Oneapproachwouldbetohaveonlyonetaskineach stagecommunicatewithtasksinotherstages;thistaskwouldthenberesponsible forinteractingwiththeothertasksinitsstagetodistributeinputdataelementsand collectoutputdataelements.Anotherapproachwouldbetointroduceadditional pipelinestagestoperformaggregation/disaggregationoperations.Eitherofthese approaches,however,involvesafairamountofcommunication.Itmaybe preferabletohavetheearlierstageknowabouttheneedsofitssuccessorand communicatewitheachtaskreceivingpartofitsdatadirectlyratherthan aggregatingthedataatonestageandthendisaggregatingatthenext.Thisapproach improvesperformanceatthecostofreducedsimplicity,modularity,andflexibility. Lesstraditionally,networkedfilesystemshavebeenusedforcommunication betweenstagesinapipelinerunninginaworkstationcluster.Thedataiswrittento afilebyonestageandreadfromthefilebyitssuccessor.Networkfilesystemsare usuallymatureandfairlywelloptimized,andtheyprovideforthevisibilityofthe fileatallPEsaswellasmechanismsforconcurrencycontrol.Higherlevel abstractionssuchastuplespacesandblackboardsimplementedovernetworkedfile systemscanalsobeused.Filesystembasedsolutionsareappropriateinlarge grainedapplicationinwhichthetimeneededtoprocessthedataateachstageis largecomparedwiththetimetoaccessthefilesystem. Handlingerrors.Forsomeapplications,itmightnecessarytogracefullyhandle errorconditions.Onesolutionistocreateaseparatetasktohandleerrors.Each stageoftheregularpipelinesendstothistaskanydataelementsitcannotprocess alongwitherrorinformationandthencontinueswiththenextiteminthepipeline. Theerrortaskdealswiththefaultydataelementsappropriately. Processorallocationandtaskscheduling.Thesimplestapproachistoallocate onePEtoeachstageofthepipeline.ThisgivesgoodloadbalanceifthePEsare similarandtheamountofworkneededtoprocessadataelementisroughlythe sameforeachstage.Ifthestageshavedifferentrequirements(forexample,oneis meanttoberunonspecialpurposehardware),thisshouldbetakeninto considerationinassigningstagestoPEs. IftherearefewerPEsthanpipelinestages,thenmultiplestagesmustbeassignedto thesamePE,preferablyinawaythatimprovesoratleastdoesnotmuchreduce overallperformance.Stagesthatdonotsharemanyresourcescanbeallocatedto thesamePE;forexample,astagethatwritestoadiskandastagethatinvolves primarilyCPUcomputationmightbegoodcandidatestoshareaPE.Iftheamount ofworktoprocessadataelementvariesamongstages,stagesinvolvinglesswork maybeallocatedtothesamePE,therebypossiblyimprovingloadbalance. AssigningadjacentstagestothesamePEcanreducecommunicationcosts.Itmight alsobeworthwhiletoconsidercombiningadjacentstagesofthepipelineintoa singlestage.

IftherearemorePEsthanpipelinestages,itisworthwhiletoconsiderparallelizing oneormoreofthepipelinestagesusinganappropriateAlgorithmStructurepattern, asdiscussedpreviously,andallocatingmorethanonePEtotheparallelized stage(s).Thisisparticularlyeffectiveiftheparallelizedstagewaspreviouslya bottleneck(takingmoretimethantheotherstagesandtherebydraggingdown overallperformance). AnotherwaytomakeuseofmorePEsthanpipelinestages,iftherearenotemporal constraintsamongthedataitemsthemselves(thatis,itdoesntmatterif,say,data item3iscomputedbeforedataitem2),istorunmultipleindependentpattern.This willimprovethelatency,however,sinceitstilltakesthesameamountoftimefora dataelementtotraversethepipeline.

Examples
Fouriertransformcomputations.Atypeofcalculationwidelyusedinsignal processinginvolvesperformingthefollowingcomputationsrepeatedlyondifferent setsofdata. 1. PerformadiscreteFouriertransform(DFT)onasetofdata. 2. Manipulatetheresultofthetransformelementwise. 3. PerformaninverseDFTontheresultofthemanipulation. Examplesofsuchcalculationsincludeconvolution,correlation,andfilteroperations. Acalculationofthisformcaneasilybeperformedbyathreestagepipeline. ThefirststageofthepipelineperformstheinitialFouriertransform;it repeatedlyobtainsonesetofinputdata,performsthetransform,andpasses theresulttothesecondstageofthepipeline. Thesecondstageofthepipelineperformsthedesiredelementwise manipulation;itrepeatedlyobtainsapartialresult(ofapplyingtheinitial Fouriertransformtoaninputsetofdata)fromthefirststageofthepipeline, performsitsmanipulation,andpassestheresulttothethirdstageofthe pipeline.Thisstagecanoftenitselfbeparallelizedusingoneoftheother AlgorithmStructurepatterns. ThethirdstageofthepipelineperformsthefinalinverseFouriertransform; itrepeatedlyobtainsapartialresult(ofapplyingtheinitialFouriertransform andthentheelementwisemanipulationtoaninputsetofdata)fromthe secondstageofthepipeline,performstheinverseFouriertransform,and outputstheresult.

Eachstageofthepipelineprocessesonesetofdataatatime.However,except duringtheinitialfillingofthepipeline,allstagesofthepipelinecanoperate concurrently,whilethe(N1)thsetofdata,andthethirdstageisprocessingthe(N 2)thesetofthedata.

Letsseeanexampleofalowpassfilter.Inordertoavoidimplementationspecifics ofaDFTandaninverseDFT,thisexampleusesMATLAB.TheDFTfunctionin MATLABisfft2,andtheinverseDFTfunctioninMATLABisifft2.Thethree stagesexplainedarethefollowing. (1)ReadtheimagefileanddoaDFT. >> >> >> >> lena = imread(lenagray.bmp); imshow(lena); lenadft = fft2(double(lena)); % do a dft imshow(real(fftshift(lenadft))); % display the dft

(2)Cropacircleout. >> >> >> >> >> >> >> >> >> >> >> >> >> [M,N] = size(lenadft); u = 0:(M-1); v = 0:(N-1); idx = find(u>M/2); u(idx) = u(idx) - M; idy = find(v>N/2); v(idy) = v(idy) - N; [V,U] = meshgrid(v,u); D = sqrt(U.^2+V.^2); P = 20; H = double(D<=P); % make mask of size P lenalowpass = lenadft .* H; imshow(H); imshow(real(fftshift(lenalowpass)));

(3)DoaninverseDFT. >> lenanew = real(ifft2(double(lenalowpass))); >> imshow(lenanew, [ ]);

Letsseehowtheactualpipelineworks.Duetothelimitedresources,letsassume thatthemachinecanonlyprocessanimagesmallerthan512pixelsx512pixels.If animagelargerthanthissizecomesin,thepipelineshouldreportanerror,though itshouldnotaffectotherimagesinthepipeline.Inthiscaseweneedanotherstage forerrorhandling.Inthisdiagram,thetimeflowsfromtoptothebottom.

Wehavefourimages,whichareDokdo,Berkeley,Nehalem,andiPhone.The Berkeleyimagesheightis540pixelssoitshouldtriggeranerror. Time Stage1 Read&DFT Stage2 Lowpassfilter Stage3 InverseDFT Stage4 Errorreporting

2 Error Imagetoobig Noprocessing

Error Imagetoobig Noprocessing

Image Dokdo Success

Image Berkeley Error Imagetoobig

Image Nehalem Success

Image iPhone Success

Knownuses
Manyapplicationsinsignalandimageprocessingareimplementedaspipelines. TheOPUS[SR98]systemisapipelineframeworkdevelopedbytheSpaceTelescope ScienceInstituteoriginallytoprocesstelemetrydatafromtheHubbleSpace Telescopeandlateremployedinotherapplications.OPUSusesablackboard architecturebuiltontopofanetworkfilesystemforinterstagecommunicationand includesmonitoringtoolsandsupportforerrorhandling. Airbornesurveillanceradarsusespacetimeadaptiveprocessing(STAP)algorithms, whichhavebeenimplementedasaparallelpipeline[CLW+00].Eachstageisitself aparallelalgorithm,andthepipelinerequiresdataredistributionbetweensomeof thestages. Fx[GOS94],aparallelizingFortrancompilerbasedonHPF[HPF97],hasbeenused todevelopseveralexampleapplications[DGO+94,SSOG93]thatcombinedata parallelism(similartotheformofparallelismcapturedintheGeometric Decompositionpattern)andpipelining.Forexample,oneapplicationperforms2D fouriertransformsonasequenceofimagesviaatwostagepipeline(onestagefor therowtransformsandonestageforthecolumntransforms),witheachstagebeing itselfparallelizedusingdataparallelism.TheSIGPLANpaper[SSOG93]isespecially interestinginthatitpresentsperformancefigurecomparingthisapproachwitha straightdataparallelismapproach. [J92]presentssomefinergrainedapplicationsofpipelining,includinginsertinga sequenceofelementsintoa23treeandpipelinedmergesort.

Relatedpatterns
Pipeandfilterpattern.ThispatternisverysimilartothePipeandFilterpattern; thekeydifferenceisthatthispatternexplicitlydiscussesconcurrency. TaskParallelismpattern.Forapplicationsinwhichtherearenotemporal dependenciesbetweenthedatainputs,analternativetothispatternisadesign basedonmultiplesequentialpipelinesexecutinginparallelandusingtheTask Parallelismpattern. DiscreteEventpattern.ThePipelinepatternissimilartotheDiscreteEvent patterninthatbothpatternsapplytoproblemswhereitisnaturaltodecomposethe

computationintoacollectionofsemiindependenttasks.Thedifferenceisthatthe DiscreteEventpatternisirregularandasynchronouswherethePipelinepatternis regularandsynchronous:InthePipelinepattern,thesemiindependenttasks representthestagesofthepipeline,thestructureofthepipelineisstatic,andthe interactionbetweensuccessivestagesisregularandlooselysynchronous.Inthe DiscreteEventpattern,however,thetaskscaninteractinveryirregularand asynchronousways,andthereisnorequirementforastaticstructure.

References
[SR98]DarylA.SwadeandJamesF.Rose.OPUS:Aflexiblepipelinedataprocessing environment.InProceedingsoftheAIAA/USUConferenceonSmallSatellites. September1998. [CLW+00]A.Choudhary,W.Liao,D.Weiner,P.Varshney,R.Linderman,andR. Brown.Design,implementation,andevaluationofparallelpipelinedSTAPon parallelcomputers.IEEETransactionsonAerospaceandElectronicSystems, 36(2):528548,April2000. [GOS94]ThomasGross,DavidR.OHallaron,andJaspalSubhlok.Taskparallelismin aHighPerformanceFortranframework.IEEEParallel&DistributedTechnology, 2(3):1626,1994. [HPF97]HighPerformanceFortranForum:HighPerformanceFortranLanguage specification,version2.0 [DGO+94]P.Dinda,T.Gross,D.OHallaron,E.Segall,J.Stichnoth,J.Subhlock,J. Webb,andB.Yang.TheCMUtaskparallelprogramsuite.TechnicalReportCMU CS94131,SchoolofComputerScience,CarnegieMellonUniversity,March1994. [SSOG93]JaspalSubhlok,JamesM.Stichnoth,DavidR.OHallaron,andThomas Gross.Exploitingtaskanddataparallelismonamulticomputers.InProceedingsof theFourthACMSIGPLANSymposiumonPrinciplesandPracticeofParallel Programming.ACMPress,May1993. [J92]J.JaJa.AnIntroductiontoParallelAlgorithms.AddisonWesley,1992.

Author
ModifiedbyYunsupLee,Ver1.0(March11,2009),basedonthepatternPipeline Patterndescribedinsection4.8ofPPP,byTimMattsonet.al.

Vous aimerez peut-être aussi