Vous êtes sur la page 1sur 9

4/28/2015

Windowfunctionsandgroupedmutate/filter

Windowfunctionsandgrouped
mutate/filter
20150113
Awindowfunctionisavariationonanaggregationfunction.Whereanaggregationfunction,likesum()
andmean(),takesninputsandreturnasinglevalue,awindowfunctionreturnsnvalues.Theoutputofa
windowfunctiondependsonallitsinputvalues,sowindowfunctionsdontincludefunctionsthatwork
elementwise,like+orround().Windowfunctionsincludevariationsonaggregatefunctions,likecumsum()
andcummean(),functionsforrankingandordering,likerank(),andfunctionsfortakingoffsets,likelead()
andlag().
Windowfunctionsareusedinconjunctionwithmutateandfiltertosolveawiderangeofproblems,some
ofwhichareshownbelow:
library(Lahman)
batting<select(tbl_df(Batting),playerID,yearID,teamID,G,AB:H)
batting<arrange(batting,playerID,yearID,teamID)
players<group_by(batting,playerID)
#Foreachplayer,findthetwoyearswithmosthits
filter(players,min_rank(desc(H))<=2&H>0)
#Withineachplayer,rankeachyearbythenumberofgamesplayed
mutate(players,G_rank=min_rank(G))
#Foreachplayer,findeveryyearthatwasbetterthanthepreviousyear
filter(players,G>lag(G))
#Foreachplayer,computeavgchangeingamesplayedperyear
mutate(players,G_change=(Glag(G))/(yearIDlag(yearID)))
#Foreachplayer,findallwheretheyplayedmoregamesthanaverage
filter(players,G>mean(G))
#Foreach,playercomputeazscorebasedonnumberofgamesplayed
mutate(players,G_z=(Gmean(G))/sd(G))

Thisvignetteisbrokendownintotwosections.Firstyoulllearnaboutthefivefamiliesofwindowfunctions
inR,andwhatyoucanusethemfor.Ifyoureonlyworkingwithlocaldatasources,youcanstopthere.
Otherwise,continueontolearnaboutwindowfunctionsinSQL.Theyarerelativelynew,butaresupported
byPostgres,AmazonsRedshiftandGooglesbigquery.Thewindowfunctionsthemselvesarebasicallythe
same(moduloafewnameconflicts),buttheirspecificationisalittledifferent.Illbriefyreviewhowthey
work,andthenshowhowdplyrtranslatestheirRequivalentstoSQL.
Beforereadingthisvignette,youshouldbefamiliarwithmutate()andfilter().Ifyouwanttousewindow
functionswithSQLdatabases,youshouldalsobefamiliarwiththebasicsofdplyrsSQLtranslation.

Typesofwindowfunctions
Therearefivemainfamiliesofwindowfunctions.Twofamiliesareunrelatedtoaggregationfunctions:
Rankingandorderingfunctions:row_number(),min_rank(RANKinSQL),dense_rank(),cume_dist(),
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

1/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

percent_rank(),andntile().Thesefunctionsalltakeavectortoorderby,andreturnvarioustypesof

ranks.
Offsetslead()andlag()allowyoutoaccessthepreviousandnextvaluesinavector,makingiteasy
tocomputedifferencesandtrends.
Theotherthreefamiliesarevariationsonfamiliaraggregatefunctions:
Cumulativeaggregates:cumsum(),cummin(),cummax()(frombaseR),andcumall(),cumany(),and
cummean()(fromdplyr).
Rollingaggregatesoperateinafixedwidthwindow.YouwontfindtheminbaseRorindplyr,but
therearemanyimplementationsinotherpackages,suchasRcppRoll.
Recycledaggregates,whereanaggregateisrepeatedtomatchthelengthoftheinput.Theseare
notneededinRbecausevectorrecyclingautomaticallyrecyclesaggregateswhereneeded.They
areimportantinSQL,becausethepresenceofanaggregationfunctionusuallytellsthedatabaseto
returnonlyonerowpergroup.
Eachfamilyisdescribedinmoredetailbelow,focussingonthegeneralgoalsandhowtousethemwith
dplyr.Formoredetails,refertotheindividualfunctiondocumentation.

Rankingfunctions
Therankingfunctionsarevariationsonatheme,differinginhowtheyhandleties:
x<c(1,1,2,2,2)
row_number(x)
#>[1]12345
min_rank(x)
#>[1]11333
dense_rank(x)
#>[1]11222

IfyourefamiliarwithR,youmayrecognisethatrow_number()andmin_rank()canbecomputedwiththe
baserank()functionandvariousvaluesoftheties.methodargument.Thesefunctionsareprovidedtosave
alittletyping,andtomakeiteasiertoconvertbetweenRandSQL.
Twootherrankingfunctionsreturnnumbersbetween0and1.percent_rank()givesthepercentageofthe
rankcume_dist()givestheproportionofvalueslessthanorequaltothecurrentvalue.
cume_dist(x)
#>[1]0.40.41.01.01.0
percent_rank(x)
#>[1]0.00.00.50.50.5

Theseareusefulifyouwanttoselect(forexample)thetop10%ofrecordswithineachgroup.For
example:
#Selectsbesttwoyears
filter(players,min_rank(desc(G))<2)
#Selectsbest10%ofyears
filter(players,cume_dist(desc(G))<0.1)

http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

2/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

Finally,ntile()dividesthedataupintonevenlysizedbuckets.Itsacoarseranking,anditcanbeusedin
withmutate()todividethedataintobucketsforfurthersummary.Forexample,wecouldusentile()to
dividetheplayerswithinateamintofourrankedgroups,andcalculatetheaveragenumberofgameswithin
eachgroup.
by_team_player<group_by(batting,teamID,playerID)
by_team<summarise(by_team_player,G=sum(G))
by_team_quartile<group_by(by_team,quartile=ntile(G,4))
summarise(by_team_quartile,mean(G))

quartile mean(G)
1

5.355776

24.912267

77.288933

373.693195

Allrankingfunctionsrankfromlowesttohighestsothatsmallinputvaluesgetsmallranks.Usedesc()to
rankfromhighesttolowest.

Leadandlag
lead()andlag()produceoffsetversionsofainputvectorthatiseitheraheadoforbehindtheoriginal

vector.
x<1:5
lead(x)
#>[1]2345NA
lag(x)
#>[1]NA1234

Youcanusethemto:
Computedifferencesorpercentchanges.
#Computetherelativechangeingamesplayed
mutate(players,G_delta=Glag(G))

Usinglag()ismoreconvenientthandiff()becauseforninputsdiff()returnsn1outputs.
Findoutwhenavaluechanges.
#Findwhenaplayerchangedteams
filter(players,teamID!=lag(teamID))
lead()andlag()haveanoptionalargumentorder_by.Ifset,insteadofusingtherowordertodetermine

whichvaluecomesbeforeanother,theywilluseanothervariable.Thisimportantifyouhavenotalready
sortedthedata,oryouwanttosortonewayandlaganother.
Heresasimpleexampleofwhathappensifyoudontspecifyorder_bywhenyouneedit:
df<data.frame(year=2000:2005,value=(0:5)^2)
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

3/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

scrambled<df[sample(nrow(df)),]
wrong<mutate(scrambled,running=cumsum(value))
arrange(wrong,year)
#>yearvaluerunning
#>1200005
#>2200115
#>3200244
#>42003955
#>520041646
#>620052530
right<mutate(scrambled,running=order_by(year,cumsum(value)))
arrange(right,year)
#>yearvaluerunning
#>1200000
#>2200111
#>3200245
#>42003914
#>520041630
#>620052555

Cumulativeaggregates
BaseRprovidescumulativesum(cumsum()),cumulativemin(cummin())andcumulativemax(cummax()).(It
alsoprovidescumprod()butthatisrarelyuseful).Othercommonaccumulatingfunctionsarecumany()and
cumall(),cumulativeversionsof||and&&,andcummean(),acumulativemean.Thesearenotincludedin
baseR,butefficientversionsareprovidedbydplyr.
cumany()andcumall()areusefulforselectingallrowsupto,orallrowsafter,aconditionistrueforthefirst

(orlast)time.Forexample,wecanusecumany()tofindallrecordsforaplayeraftertheyplayedayearwith
150games:
filter(players,cumany(G>150))

Likeleadandlag,youmaywanttocontroltheorderinwhichtheaccumulationoccurs.Noneofthebuiltin
functionshaveanorder_byargumentsodplyrprovidesahelper:order_by().Yougiveitthevariableyou
wanttoorderby,andthenthecalltothewindowfunction:
x<1:10
y<10:1
order_by(y,cumsum(x))
#>[1]55545249454034271910

Thisfunctionusesabitofnonstandardevaluation,soIwouldntrecommendusingitinsideanother
functionusethesimplerbutlessconcisewith_order()instead.

Recycledaggregates
Rsvectorrecyclingmakeiteasytoselectvaluesthatarehigherorlowerthanasummary.Icallthisa
recycledaggregatebecausethevalueoftheaggregateisrecycledtobethesamelengthastheoriginal
vector.Recycledaggregatesareusefulifyouwanttofindallrecordsgreaterthanthemeanorlessthan
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

4/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

themedian:
filter(players,G>mean(G))
filter(players,G<median(G))

WhilemostSQLdatabasesdonthaveanequivalentofmedian()orquantile(),whenfilteringyoucan
achievethesameeffectwithntile().Forexample,x>median(x)isequivalenttontile(x,2)==2
x>quantile(x,75)isequivalenttontile(x,100)>75orntile(x,4)>3.
filter(players,ntile(G,2)==2)

Youcanalsousethisideatoselecttherecordswiththehighest(x==max(x))orlowestvalue(x==min(x))
forafield,buttherankingfunctionsgiveyoumorecontroloverties,andallowyoutoselectanynumberof
records.
Recycledaggregatesarealsousefulinconjunctionwithmutate().Forexample,withthebattingdata,we
couldcomputethecareeryear,thenumberofyearsaplayerhasplayedsincetheyenteredtheleague:
mutate(players,career_year=yearIDmin(yearID)+1)

playerID

yearID teamID G

AB R H career_year

aardsda01 2004

SFN

11 0

0 0 1

aardsda01 2006

CHN

45 2

0 0 3

aardsda01 2007

CHA

25 0

0 0 4

aardsda01 2008

BOS

47 1

0 0 5

..

..

Or,asintheintroductoryexample,wecouldcomputeazscore:
mutate(players,G_z=(Gmean(G))/sd(G))

playerID

yearID teamID G

AB R H G_z

aardsda01 2004

SFN

11 0

0 0 1.1167685

aardsda01 2006

CHN

45 2

0 0 0.3297126

aardsda01 2007

CHA

25 0

0 0 0.5211586

aardsda01 2008

BOS

47 1

0 0 0.4147997

..

..

WindowfunctionsinSQL
WindowfunctionshaveaslightlydifferentflavourinSQL.Thesyntaxisalittledifferent,andthecumulative,
rollingandrecycledaggregatefunctionsareallbasedonthesimpleaggregatefunction.Thegoalinthis
sectionisnottotellyoueverythingyouneedtoknowaboutwindowfunctionsinSQL,buttoremindyouof
thebasicsandshowyouhowdplyrtranslatesyourRexpressionsintoSQL.
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

5/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

StructureofawindowfunctioninSQL
InSQL,windowfunctionshavetheform
[expression]OVER([partitionclause][orderclause][frame_clause]):

Theexpressionisacombinationofvariablenamesandwindowfunctions.Supportforwindow
functionsvariesfromdatabasetodatabase,butmostsupporttherankingfunctions,lead,lag,nth,
first,last,count,min,max,sum,avgandstddev.dplyrgeneratesthisfromtheRexpressioninyour
mutateorfiltercall.
Thepartitionclausespecifieshowthewindowfunctionisbrokendownovergroups.Itplaysan
analogousroletoGROUPBYforaggregatefunctions,andgroup_by()indplyr.Itispossiblefordifferent
windowfunctionstobepartitionedintodifferentgroups,butnotalldatabasessupportit,andneither
doesdplyr.
Theorderclausecontrolstheordering(whenitmakesadifference).Thisisimportantforthe
rankingfunctionssinceitspecifieswhichvariablestorankby,butitsalsoneededforcumulative
functionsandlead.WheneveryourethinkingaboutbeforeandafterinSQL,youmustalwaystellit
whichvariabledefinestheorder.Indplyryoudothiswitharrange().Iftheorderclauseismissing
whenneeded,somedatabasesfailwithanerrormessagewhileothersreturnnondeterministic
results.
Theframeclausedefineswhichrows,orframe,thatarepassedtothewindowfunction,describing
whichrows(relativetothecurrentrow)shouldbeincluded.Theframeclauseprovidestwooffsets
whichdeterminethestartandendofframe.Therearethreespecialvalues:Infmeanstoincludeall
preceedingrows(inSQL,unboundedpreceding),0meansthecurrentrow(currentrow),andInf
meansallfollowingrows(unboundedfollowing).Thecompletesetofoptionsiscomprehensive,but
fairlyconfusing,andissummarisedvisuallybelow.

Ofthemanypossiblespecifications,thereareonlythreethatcommonlyused.Theyselectbetween
aggregationvariants:
Recycled:BETWEENUNBOUNDPRECEEDINGANDUNBOUNDFOLLOWING
Cumulative:BETWEENUNBOUNDPRECEEDINGANDCURRENTROW
Rolling:BETWEEN2PRECEEDINGAND2FOLLOWING
dplyrgeneratestheframeclausebasedonwhetheryourusingarecycledaggregateoracumulative
aggregate.
Itseasiesttounderstandthesespecificationsbylookingatafewexamples.Simpleexamplesjustneedthe
partitionandorderclauses:
Rankeachyearwithinaplayerbynumberofhomeruns:
RANK()OVER(PARTITIONBYplayerIDORDERBYdesc(H))
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

6/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

Computechangeinnumberofgamesfromoneyeartothenext:
GLAG(G)OVER(PARTITIONGplayerIDORDERBYyearID)

Aggregatevariantsaremoreverbosebecausewealsoneedtosupplytheframeclause:
RunningsumofGforeachplayer:
SUM(G)OVER(PARTITIONBYplayerIDORDERBYyearIDBETWEENUNBOUNDPRECEEDINGANDCURRENTROW)

Computethecareeryear:
YearIDmin(YearID)OVER(PARTITIONBYplayerIDBETWEENUNBOUNDPRECEEDINGANDUNBOUNDFOLLOWING)+1

Computearollingaverageofgamesplayer:
MEAN(G)OVER(PARTITIONBYplayerIDORDERBYyearIDBETWEEN2PRECEEDINGAND2FOLLOWING)

YoullnoticethatwindowfunctionsinSQLaremoreverbosethaninR.Thisisbecausedifferentwindow
functionscanhavedifferentpartitions,andtheframespecificationismoregeneralthanthetwoaggregate
variants(recycledandcumulative)providedbydplyr.dplyrmakesatradeoff:youcantaccessrarelyused
windowfunctioncapabilities(unlessyouwriterawSQL),butinreturn,commonoperationsaremuchmore
succinct.

TranslatingdplyrtoSQL
ToseehowindividualwindowfunctionsaretranslatedtoSQL,wecanusetranslate_sql()withthe
argumentwindow=TRUE.
if(has_lahman("postgres")){
players_db<group_by(tbl(lahman_postgres(),"Batting"),playerID)

print(translate_sql(mean(G),tbl=players_db,window=TRUE))
print(translate_sql(cummean(G),tbl=players_db,window=TRUE))
print(translate_sql(rank(G),tbl=players_db,window=TRUE))
print(translate_sql(ntile(G,2),tbl=players_db,window=TRUE))
print(translate_sql(lag(G),tbl=players_db,window=TRUE))
}
#><SQL>avg("G")OVER(PARTITIONBY"playerID"ROWSBETWEENUNBOUNDEDPRECEDINGANDUNBOUNDEDFOLLOWING)
#><SQL>mean("G")OVER(PARTITIONBY"playerID"ROWSUNBOUNDEDPRECEDING)
#><SQL>rank()OVER(PARTITIONBY"playerID"ORDERBY"G")
#><SQL>NTILE(2.0)OVER(PARTITIONBY"playerID"ORDERBY"G")
#><SQL>LAG("G",1,NULL)OVER(PARTITIONBY"playerID")

Ifthetblhasbeenarrangedpreviously,thenthatorderingwillbeusedfortheorderclause:
if(has_lahman("postgres")){
players_by_year<arrange(players_db,yearID)
print(translate_sql(cummean(G),tbl=players_by_year,window=TRUE))
print(translate_sql(rank(),tbl=players_by_year,window=TRUE))
print(translate_sql(lag(G),tbl=players_by_year,window=TRUE))
}
#><SQL>mean("G")OVER(PARTITIONBY"playerID"ORDERBY"yearID"ROWSUNBOUNDEDPRECEDING)
#><SQL>rank()OVER(PARTITIONBY"playerID"ORDERBY"yearID")
#><SQL>LAG("G",1,NULL)OVER(PARTITIONBY"playerID"ORDERBY"yearID")

TherearesomechallengeswhentranslatingwindowfunctionsbetweenRandSQL,becausedplyrtriesto
keepthewindowfunctionsassimilaraspossibletoboththeexistingRanaloguesandtotheSQLfunctions.
Thismeansthattherearethreewaystocontroltheorderclausedependingonwhichwindowfunction
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

7/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

youreusing:
Forrankingfunctions,theorderingvariableisthefirstargument:rank(x),ntile(y,2).Ifomittedor
NULL,willusethedefaultorderingassociatedwiththetbl(assetbyarrange()).
Accumulatingaggegatesonlytakeasingleargument(thevectortoaggregate).Tocontrolordering,
useorder_by().
Aggregatesimplementedindplyr(lead,lag,nth_value,first_value,last_value)haveanorder_by
argument.Supplyittooverridethedefaultordering.
Thethreeoptionsareillustratedinthesnippetbelow:
mutate(players,
min_rank(yearID),
order_by(yearID,cumsum(G)),
lead(order_by=yearID,G)
)

Currentlythereisnowaytoorderbymultiplevariables,exceptbysettingthedefaultorderingwith
arrange().Thiswillbeaddedinafuturerelease.

Translatingfiltersbasedonwindowfunctions
TherearesomerestrictionsonwindowfunctionsinSQLthatmaketheirusewithWHEREsomewhat
challenging.Takethissimpleexample,wherewewanttofindtheyeareachplayerplayedthemostgames:
filter(players,rank(G)==1)

Thefollowingstraightforwardtranslationdoesnotworkbecausewindowfunctionsareonlyallowedin
SELECTandORDER_BY.
SELECT*
FROMBatting
WHERErank()OVER(PARTITIONBY"playerID"ORDERBY"G")=1;

ComputingthewindowfunctioninSELECTandreferringtoitinWHEREorHAVINGdoesntworkeither,because
WHEREandHAVINGarecomputedbeforewindowingfunctions.
SELECT*,rank()OVER(PARTITIONBY"playerID"ORDERBY"G")asrank
FROMBatting
WHERErank=1;
SELECT*,rank()OVER(PARTITIONBY"playerID"ORDERBY"G")asrank
FROMBatting
HAVINGrank=1;

Instead,wemustuseasubquery:
SELECT*
FROM(
SELECT*,rank()OVER(PARTITIONBY"playerID"ORDERBY"G")asrank
FROMBatting
)tmp
WHERErank=1;
http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

8/9

4/28/2015

Windowfunctionsandgroupedmutate/filter

Andeventhatqueryisaslightlysimplificationbecauseitwillalsoaddarankcolumntotheoriginal
columns.dplyrtakescareofgeneratingthefull,verbose,query,soyoucanfocusonyourdataanalysis
challenges.

http://cran.rstudio.com/web/packages/dplyr/vignettes/windowfunctions.html

9/9