Vous êtes sur la page 1sur 5

RecommendationsForRedditUsers

AvidehTaalimaneshandMohammadAleagha
StanfordUniversity,December2012

Abstract

Inthispaperweattempttodevelopan
algorithmtogenerateasetofpost
recommendationsforusersofthesocial
newswebsiteRedditgiventheirprior
votinghistory.Weattemptedthree
variationsofKmeansclustering.Wefirst
attemptedtoclusteruserssimplybasedon
theirvotingrecordandthenattemptedto
clusterusersbasedonattributesofthe
poststheyhadvotedpositivelyon.Bothof
theseapproachesproducedverylarge
recommendationsetswithpoorto
moderaterecall.Finallyweattemptedto
clusterpostsbasedonkeywords
appearinginthetitleandobservedmuch
higherrecallbutlowerprecisionasthe
recommendationsetsthatwereproduced
weregenerallymuchlarger.Inallthree
caseswefoundthattheinputdatawas
sparseandquitelargeandwouldrequirea
significantamountofpruningifthese
algorithmsweretobeusedinapractical
setting.Wealsofoundthatthesetsof
recommendationsthatweregenerated
wereoftenverylargeandthatsome
heuristicswouldneedtobeappliedto
reducetheirsizewhileattemptingto
preservethequalityofthe
recommendations.

1 Introduction

1.1 Backgroundandmotivation

Redditisasocialnewswebsitewhere
userscansubmitcontentandhaveother
userscommentandvote(upordown)on
theirsubmissions.Since2005,Reddithas
grownintoahugecommunityofvery
activeusers;inthemonthofOctober
(2012)alone,Redditsaw46,839,289
uniqueuserswhoviewed3,832,477,975
pages
1
.Withsomanypages,discovering
newandinterestingcontentcanbevery
challenging.Onewaythewebsitehasbeen
abletorecommendcontenttoitsusersis
bylettingthemsubscribetosubreddits.A
subredditisessentiallyacommunity
focusedonaspecifictopicsuchasscience
ormusic.Recommendationsarethen
madebasedonthetopvotedpostswithin
thesubredditsauserissubscribedto.
Despitethis,usersstilloftenfinditdifficult
tofindcontenttheyaretrulyinterestedin.
In2010,Redditgaveitsuserstheoptionto
maketheirvotespubliclyavailableand
laterreleasedsomeofthatvotingdatafor
researchpurposes
2
.Weproposetouse
thisdatatogeneraterecommendationsfor
usersbasedontheirvotinghistory.

1.2 Datapreparation

Theformatofthepubliclyavailabledatais
simple;eachentryconsistsofauserid,a
postidandanupordownvote(+1or1).
Wewereabletoobtainatotalof7,405,561
votesconsistingof31,553distinctusers
votingon2,046,401distinctposts.In
additiontothisvotingdata,Reddithasa
publicAPI
3
whichallowsustomakea
requestforaparticularpostidandobtain
certainmetadataaboutthepostasajson
string.Thismetadataincludesamong
otherthingsthepostsoriginatingdomain,
thesubredditthepostbelongstoaswell
asthetitleofthepost.Forthepurposesof
thisresearchprojectandtomake
operatingonthedatafeasiblewiththe

1
http://www.reddit.com/about
2

http://www.reddit.com/r/announcements/comme
nts/ddz0s/reddit_wants_your_
permission_to_use_your_data_for/,
http://www.reddit.com/r/redditdev/comments/bu
bhl/csv_dump_of_reddit_voting_data/
3
https://github.com/reddit/reddit/wiki/API
computationalresourcesavailabletous,
welimitedoureffortstoasetof1,000
users
4
votingon174,886distinctposts.
WewroteaseriesofscriptsinJavato
parsethevotingdata,makerequeststo
Redditsserversformetadataandtobuild
theinputdata(designmatrices)forour
learningalgorithms.

1.3 Overviewofapproaches

Wewillattempttotacklethisproblem
usingafewvariationsofKMeans
clustering.Ourfirstattemptwillbeto
clusteruserssimplybasedonposts
theyvevotedoninthepast.Theintuition
behindthisapproachisthatuserswho
votesimilarlyonthesamesetofpostswill
likelysharesimilarinterests.Wecan
leveragethisfacttogenerate
recommendationsbasedonpostsup
votedbysimilarusers.Oursecondattempt
willagainbetoclusterusers,butthistime
basedoncertainattributesoftheposts
theyvevotedon,namelyoriginating
domainandthesubredditthepostbelongs
to.Thiswillgiveusaslightlycoarserview
ofauser'sinterestscomparedtothefirst
approachbutwillrequireamuchsmaller
featurevectorthatwillnotgrowevery
timeanewpostissubmittedandwillnot
beassparse.Asbefore,wecanusethe
clustereduserstogenerateasetof
recommendations.Thefinalapproachwill
betoclusterpostsratherthanusersbased
onkeywordsappearinginthetitleofthe
post.Thecontentofapostcanbeanything
fromanewsarticletoavideoorevenan
imagebutallpostsinvariablyhaveatitle.
Whatsmore,Redditactivelyencourages
itsuserstogivemeaningfuldescriptive
titlestotheirposts
5
.Oncepostsare
clusteredbasedonkeywords,wecan
identifythoseclusterswhichcontainposts

4
Wedecidedonlimitingourdatasetto1000users
afterouripwasblockedbyRedditformakingtoo
manyrequestsinashorttimeperiod.TheReddit
teamwaskindenoughtounblockusoncewe
promisedtoslowdownourrequests.
5
http://www.reddit.com/help/faq
upvotedbyauserandusethesetofposts
fromthoseclusterstogenerate
recommendations.

2 Methodology

2.1 Approach1:Clusteringusers
basedonvotes

Thefeaturevectorinthisapproach
consistedofallposts
6
andthevalueseach
featurecouldtakeonwere1,0or+1
(downvote,novoteandupvote
respectively).Werankmeanson95%of
thedata(950users)withksetto10,25,
50and100.Onceclusteringwasachieved
wethen,foreachoftheremainingusersu
i
,
didthefollowingtogenerate
recommendations:
i)Wewithheld10%ofupvotesfrom
useru
i

ii)Withtheremainingvotesforu
i
,we
foundthesetU
i
ofusersinthesame
clusterasu
i
andconstructedthesetP
i
of
allpostsupvotedbytheusersinU
i
.
iii)WethenfilteredthesetP
i
toremove
postsu
i
hadalreadyvotedontoobtaina
setofrecommendedpostsR
i
(inpractice,
wecouldalsothenrankthepostsinR
i
by
popularity(mostupvotes)andthenonly
showtheuserthetoptposts).
iv)Wethentestedour
recommendationsusingthe10%of
withheldupvotesandassignedascoreS
i

whichis(#ofwithheldupvotesforu
i
that
appearinR
i
)/(#ofwithheldupvotesfor
u
i
).

2.2 Approach2:Clusteringusers
basedonattributesofpostsupvoted

Thefeaturevectorinthisapproach
consistedoftheoriginatingdomainofthe
posts
7
aswellasthesubredditsthey

6
Hereweleftoutpostshavingonlyonevoteasthey
providednovaluableinformation,andwereleft
with31,833posts.
7
30,373postswereconsidered(thesewerethe
postsupvotedbytheconsideredusers),with
belongedto.Thevalueseachfeaturecould
takeonwerethesumofupvotesbyauser
forpostshavingthoseattributes.For
example:

domains subreddits
youtube imgur music funnypics
u1 5 7 1 1
u2 2 0 4 3

Asbefore,werankmeanson95%ofthe
data(950users)withksetto10,25,50
and100.Onceclusteringwasachievedwe
thenrepeatedthesteps(i)to(iv)from2.1
toobtainasetofrecommendationsR
i
and
ascoreS
i
foreachuseru
i
.

2.3 Approach3:Clusteringposts
basedonkeywordsinthetitle

Forthisapproach,ratherthanclustering
users,weclusteredtheposts
8
themselves
basedonkeywordsfoundinthetitleofthe
posts.Togeneratethedictionaryofwords,
weranPortersstemmingalgorithm[1]on
thesetofwordspresentinthetitlesofthe
posts.Tofurthertrimdownthedictionary,
weremovedasetofstandardstopwords
suchastheandof[2].Wethen
generatedthefeaturevectorsforeachpost
fromthisdictionary
9
wherethevalueofa
featurewasthepresence(1or0)ofthe
givenwordinthetitleofthatpost.We
thenrankmeansonallpostswith
differentvaluesfork.Onceclusteringwas
achievedwethen,foreachofasmallsetof
usersu
i
(50),didthefollowingtogenerate
recommendations:
i)Wewithheld10%ofupvotesfrom
useru
i

ii)Withtheremainingvotes,wefound
whichclusterstheremainingupvoted
postsfromu
i
belongedto.Fromthese

27,488differentdomainsand1,117different
subreddits
8
5,397postswereconsidered(theseweretheposts
upvotedbytheconsideredusers)
9
Weendedupwithadictionaryof8,880words
clustersk
i,j
weconstructedthesetP
i
ofall
postsbelongingtok
i,j
.
iii)WethenfilteredthesetP
i
toremove
postsu
i
hadalreadyvotedontoobtaina
setofrecommendedpostsR
i
.
iv)Wethentestedour
recommendationsusingthe10%of
withheldupvotesandassignedascoreS
i

whichwas(#ofwithheldupvotesforu
i

thatappearinR
i
)/(#ofwithheldupvotes
foru
i
).

3 ResultsandAnalysis

3.1Initialobservations

Upongeneratingthedesignmatrixforour
firstalgorithm,itquicklybecameobvious
thatthedatawasextremelysparse.Ofall
thepostsbeingconsidered,agivenuser
hadseenandvotedonafractionof1%of
them.Thisisnotunexpectedgiventhe
hugenumberofnewpoststhatare
submittedtoRedditonadailybasis.In
addition,thedimensionsofthisdesign
matrix(1000x31,833)werequitelarge
(andwouldbeexpectedtogrowmuch
largerastimegoeson)sincethefeature
vectorwasmadeupofthevoteforevery
postunderconsideration.

Thedesignmatrixforthesecond
algorithmwasslightlylesssparseasthere
wassubstantialoverlapofdomainsand
subredditsbetweenposts.Thedimensions
ofthismatrix(1000x28,605),whilealso
quitelarge,weremoremanageableand
wouldnotbeexpectedtogrowindefinitely
asthenumberofdomainsandsubreddits
willremainrelativelyconstantovertime.

Thedesignmatrixforthethirdalgorithm
wouldhavegrowntobeextremelylarge
hadwecontinuedtoconsiderallposts
votedonby1000users,notduetothesize
ofthefeaturevector(thedictionarywould
havehad22,547words)butsimplydueto
thenumberofpoststobeclustered
(34,764posts).Weoptedtoperformthis
clusteringforonly50users(resultingin
5,397postsandadictionaryof8,880
words).Thiswasstillrather
computationallyexpensiveand
anecdotallytookverylongtorun.

3.2Results

k=10 k=25 k=50 k=100


Avg.S
i
0.7100 0.6636 0.6004 0.3866
Avg.|R
i
| 21,884 21,369 19,155 11,819
Rratio* 0.6875 0.6713 0.6017 0.3713
Qscore** 1.0328 0.9885 0.9977 1.0413
Table1:Resultsforapproach1

k=10 k=25 k=50


k=
100
Avg.S
i
0.1490 0.1299 0.0717 0.0517
Avg.|R
i
| 17,179 8,737 5,789 3,053
Rratio* 0.5656 0.2877 0.1906 0.1005
Qscore** 0.2633 0.4516 0.3760 0.5148
Table2:Resultsforapproach2

k=10 k=25 k=50


k=
100
Avg.S
i
0.9969 0.9815 0.9785 0.9508
Avg.|R
i
| 4,334 3,734 3,678 3,084
Rratio* 0.8030 0.6919 0.8030 0.8030
Qscore** 1.2414 1.4187 1.2184 1.1840
Table3:Resultsforapproach3

*Avg|Ri|/|Allposts|
**AvgSi/Rratio

3.3Analysis

Onekeyfactthatmustbekeptinmindis
thatthedataavailabletousisinnoway
completeinthesensethatausers
preferenceisonlyknownforaverysmall
numberofposts.Thereforethescores
weveassignedtothevarious
recommendationsetswevegeneratedwill
giveusanintuitionabouttheapproach
takenbutdonotentirelyreflectthequality
oftherecommendationset(hadauser
happenedtohaveseenmoreposts,they
mayhaveupvotedthosepresentamong
therecommendations).

Thetwometricsofinterestwhen
evaluatingtheapproacheswevetakenare
thescoreS
i
andthesizeofthe
recommendationsetrelativetothe
numberofpostsconsideredwhichwell
calltheRratio.Wewanttomaximizethe
averageS
i
whileminimizingthesizeofthe
recommendationsetssowellcompute
anotherscoreQwhichwelldefineasAvg.
S
i
/Avg.Rratio.

Figure1

Wecanseefromtheresultsthatthe
approachwhichhadthehighestQscore
wasthe3
rd
approachwhich,althoughit
generatedfairlylargerecommendation
sets,showedamuchhigherrecallwiththe
highestaverageS
i
scores.The2
nd
approach
didtheworstoutofthethreeapproaches
withbothlargerecommendationsetsas
wellaslowaverageS
i
.The1
st
approach
simplydidnothaveenoughdatato
adequatelyclusterusersandwhatwe
observedwasusuallytheformationofone
verylargeclustercontainingmostofthe
userswiththerestoftheclusters
containingaverysmallnumberofusers.
ThisresultedindecentS
i
scoresforthe
usersinthelargecluster(ifmostother
usersareinthesameclusterasyou,
chancesareoneofthemwillhaveup
votedanarticleyouupvoted)butvery
largerecommendationsets.

4 Conclusion

4.1 Inputdata

Wefoundthatitwasverydifficultto
generategoodrecommendationswithonly
averylimitedamountofdataabouteach
userspreferences.Inthe2methodswe
usedwhichclusteredusersbasedon
votinghistory,wefoundthatinsome
casesitwassimplyimpossibleto
recommendallarticlesthatauserhadup
votedbecausenootheruserinthesethad
upvotedthatarticle.Thesparsenessof
thefeaturevectorsaside,thesheersizeof
thesetswewouldhaveneededtooperate
on(numberofusersandnumberofposts)
wouldnothavebeenpossiblehadwe
wantedtoclusterallRedditusers.Itis
obviousthattouseanyofthese
algorithmsinpracticewouldrequire
significantpruningofthedatasuchas
segmentingusersbasedonsome
attributes(subredditsubscription,
geographiclocation,etc.)andthenrunning
thealgorithmsoneachsegment.Another
factortotakeintoconsiderationistheage
ofapost;tofurthertrimdownthedata,
postsolderthanacertainthresholdcould
beleftout(stalepostsarenotvaluable
recommendationsanyway).

4.2 Recommendationsets

Anotherdifficultyweencounteredwas
producingreasonablysized
recommendationsets.Evenifwecan
produceallofthepostsausercouldever
beinterestedin,iftheyarehiddenina
giganticsetofrecommendationstheuser
willneverfindthemandwehaventdone
muchtoimprovetheexperience.Wecould
usesomeheuristicstotrimdownthesize
oftherecommendationsetattheriskof
losingafewgoodrecommendations.One
heuristiccouldbe,asmentionedinthe
previoussection,toomitpostswhichare
morethanafewdays/weeksold
altogetherascontentgoesstaleovertime.
Anotherapproachcouldbetonottrim
downtherecommendationsetatallbut
ratherpresentthepoststotheuserinan
orderwhichwethinkwouldmakethebest
recommendationsbetheeasiesttofind.
Onewaytoachievethiswouldbe,for
instance,toorderthepostsbypopularity
(mostupvotes).

4.2 FutureWork

Asidefromtheimprovementstotheinput
dataandthepostprocessingofthe
generatedrecommendationsoutlinedin
theprevioussections,moreworkcouldbe
donetoimprovetheclusteringalgorithms
themselves.Givenourbestperforming
algorithm(clusteringpostsbasedon
keywords),oneeasyimprovementwould
betoincludethesubredditandoriginating
domainofthepostinthefeaturevector
alongwiththedictionaryofwords.
Anotherpossibleimprovementwouldbe
toassignascoretoeachselectedcluster
forauserbasedontheratioofdown
votedtoupvotedpoststhatclusters
containsandselecttheoneswiththe
highestscoresratherthanselectthemall
togeneraterecommendations.

5 References

[1]M.F.Porter,"Analgorithmforsuffix
stripping",Originallypublishedin
Program,14no.3,pp130137,(July1980)

[2]DavidD.Lewis,YimingYang,TonyG.
Rose,andFanLi.,"RCV1:ANew
BenchmarkCollection
forTextCategorizationResearch"(2004),
JournalofMachineLearningResearch5
(2004)361397

Vous aimerez peut-être aussi