Vous êtes sur la page 1sur 4

Improving regularized singular value decomposition for

collaborative filtering

Arkadiusz Paterek
Institute of Informatics, Warsaw University
ul. Banacha 2, 02-097 Warsaw, Poland

A key part of a recommender system is a collaborative filter- Recommender systems are very important for e-commerce.
ing algorithm predicting users’ preferences for items. In this If a company offers many products to many clients, it can
paper we describe different efficient collaborative filtering benefit substantially from presenting personalized recom-
techniques and a framework for combining them to obtain a mendations. For example, Greg Linden, developer of Ama-
good prediction. zon’s recommendation engine [6], reported that in 2002 over
The methods described in this paper are the most im- 20% of Amazon’s sales resulted from personalized recom-
portant parts of a solution predicting users’ preferences for mendations. There exist many commercial applications of
movies with error rate 7.04% better on the Netflix Prize recommender systems for products like books, movies, music
dataset than the reference algorithm Netflix Cinematch. and others. Also many applications not directly commercial
The set of predictors used includes algorithms suggested have emerged: personalized recommendations for websites,
by Netflix Prize contestants: regularized singular value de- jokes [4], Wikipedia articles, etc.
composition of data with missing values, K-means, postpro- A difficult part of building a recommender system is, know-
cessing SVD with KNN. We propose extending the set of ing preferences of users for some items, to accurately predict
predictors with the following methods: addition of biases to which other items they will like. This task is called collab-
the regularized SVD, postprocessing SVD with kernel ridge orative filtering. Most approaches to this task, described so
regression, using a separate linear model for each movie, far in the literature, are variations of K-nearest neighbors
and using methods similar to the regularized SVD, but with (like TiVo [1]) or singular value decomposition (like Eigen-
fewer parameters. Taste [4]). Another approach is using graphical models [7,
All predictors and selected 2-way interactions between 8]. Articles [3, 7] are examples of comparisons of different
them are combined using linear regression on a holdout set. collaborative filtering techniques.
In October 2006, the contest Netflix Prize was announced.
The goal of the contest is to produce a good prediction of
users’ preferences for movies. Netflix released a database of
Categories and Subject Descriptors over 100 million movie ratings made by 480189 users. The
I.2.6 [Artificial Intelligence]: Learning; contest ends when someone submits a solution with pre-
H.3.3 [Information storage and retrieval]: Information diction error RMSE (root mean squared error) 10% better
search and retrieval—Information filtering than the Netflix Cinematch algorithm. For introduction to
the Netflix Prize competition and description of Netflix Cin-
ematch we direct the reader to the article [2].
General Terms This paper describes various collaborative filtering algo-
rithms that work well on the Netflix Prize dataset. Using
Algorithms, Experimentation, Performance approach of combining results of many methods with lin-
ear regression we obtained 7.04% better RMSE than Netflix
Cinematch on the Netflix Prize competition evaluation set.
Keywords In section 2 we describe our framework for combining pre-
dictions with linear regression. Most effective predictors
prediction, collaborative filtering, recommender systems, Net- from our ensemble are described in section 3, including ap-
flix Prize proaches proposed by Netflix Prize contestants: regularized
SVD of data with missing values, K-means, postprocessing
results of regularized SVD with K-NN. In that section we de-
scribe also new (to our knowledge) approaches: regularized
Permission to make digital or hard copies of all or part of this work for SVD with biases, postprocessing results of SVD with kernel
personal or classroom use is granted without fee provided that copies are ridge regression, building a separate linear model for each
not made or distributed for profit or commercial advantage and that copies
movie, and using two methods inspired by regularized SVD,
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific but with lower number of parameters. In section 4 experi-
permission and/or a fee. mental results are presented, which show that combining the
KDDCup.07 August 12, 2007, San Jose, California, USA proposed predictors leads to a significantly better prediction
Copyright 2007 ACM 978-1-59593-834-3/07/0008 ...$5.00.

than using pure regularized SVD. In section 5 we summarize 3. PREDICTORS
our experiments and discuss possible further improvements.
3.1 Simple predictors
In this section we describe six predictors which are used by
2. COMBINING PREDICTORS methods from subsections 3.2 (RSVD) and 3.5 (SVD KNN)
and also in all experiments described in section 4.
In this section we describe how in the proposed solution
For the given movie j rated by user i, first five predictors
the training and the test set are chosen and how different
are empirical probabilities of each rating 1 − 5 for user i.
prediction methods are combined with linear regression.
The sixth predictor is the mean rating of movie j, after
The Netflix Prize data consists of three files:
subtracting the mean rating of each member.
We will refer to that set of six simple predictors as ”BA-
• training.txt contains R = 100, 480, 507 ratings on a SIC”.
scale 1 to 5, for M = 17, 770 movies, made by N =
480, 189 customers, 3.2 Regularized SVD
Regularized SVD, a technique inspired by effective meth-
ods from the domain of natural language processing [5], was
• probe.txt contains 1, 408, 395 user-movie pairs, for which proposed for collaborative filtering by Simon Funk (Brandyn
the ratings are provided in training.txt, Webb) [9]. Simon Funk’s description [9] includes proposition
of learning rate and regularization constants, and a method
of clipping predictions.
• qualifying.txt contains 2, 817, 131 user-movie pairs, for In the regularized SVD predictions for user i and movie j
which we do not know the ratings, but RMSE of a are made in the following way:
prediction is computed by the Netflix Prize evaluation
system. We can assume probe.txt and qualifying.txt ŷij = uT
i vj (1)
come from the same population of newest ratings.
where ui and vj are K-dimensional vectors of parameters.
The layer of k-th parameters of all vectors ui , vj is called
Summarizing, the user-item matrix for this data has N ∗ the k-th feature.
M = 8, 532, 958, 530 elements – c.a. 98.9% values are miss- Parameters are estimated by minimizing the sum of squared
ing. residuals, one feature at a time, using gradient descent with
The dataset besides ratings contains other information, regularization and early stopping. Before training, from
like the dates of the ratings, but we do not use for prediction each rating a simple baseline prediction is subtracted – com-
any information besides the above mentioned rating data. bination of six predictors described in section 2.1, with weights
Our framework for combining predictions is simple: we chosen with linear regression.
draw random 1.5% − 15% of probe.txt as a test set (hold-
out set). Our training set contains the remaining ratings rij = yij − ŷij
from training.txt. We train all algorithms on the training
set (some methods also occasionally observe test set error uik += lrate ∗ (rij vjk − λuik )
to make a decision when to stop optimization of weights).
Then the predictions made by each algorithm for the test set vjk += lrate ∗ (rij uik − λvjk )
are combined with linear regression on the test set. Adding
to the regression selected two-way interactions between pre- , where yij is the rating given by user i for movie j.
dictors gives a small improvement. We stop training the feature when the error rate on the
There is also a possibility of using data without ratings test set increases. After learning of each feature, the predic-
(qualifying.txt), which carry some information. The article tions are clipped to < 1, 5 > range.
[8] suggests that using this additional data can significantly Parameters proposed by Simon Funk are difficult to im-
improve prediction in the Netflix Prize task. prove, so we leave them unchanged: lrate = .001, λ = .02.
Because linear regression is made on a small set, the weights We choose the number of features K = 96.
obtained are inaccurate. Also, using the test set for lin- We will refer to this method as ”RSVD”.
ear regression, feature selection and other purposes causes
small overfitting. We can improve prediction using a cross- 3.3 Improved regularized SVD
validation-like method: draw randomly a part of probe.txt We add biases to the regularized SVD model, one param-
as the test set, repeat the training and linear regression, eter ci for each user and one dj for each movie:
do this a few times and average the results. However, each
repetition means running again all algorithms on a massive ŷij = ci + dj + uT
i vj (2)
dataset. Because training each one of our algorithms takes
Weights ci , dj are trained simultaneously with uik and vjk .
much time (0.5-20h), we did not perform cross-validation in
experiments described in section 4. ci += lrate ∗ (rij − λ2 (ci + dj − global mean))
The 7.04% submission to the Netflix Prize is a result of
partial cross-validation. We ran part of our methods for dj += lrate ∗ (rij − λ2 (ci + dj − global mean))
the second time on a different test set and confirmed an
improvement after merging results of two linear regressions. Values of parameters: lrate = .001, λ2 = .05, global mean =
In the next sections we describe the most effective predic- 3.6033.
tors from our ensemble. We will refer to this method as ”RSVD2”.

3.4 K-means possibly much higher dimensional space, implicitly defined
K-means and K-medians were proposed for collaborative by kernel K. Predictions in this method are made in the
filtering in [7]. following way:
Before applying K-means we subtract from each rating ŷi = K(xT −1
i , X)(K(X, X) + λI) y (9)
the user’s mean rating. K-means algorithm is used to di-
vide users into K clusters Ck , minimizing the intra-cluster We can look for a kernel (defining similarity or distance
variance. between observations) which will result in better prediction
XK X than K(xT T T
i , xj ) = xi xj . We obtained good results with
||yi − µk ||2 (3) Gaussian kernel K(xi , xT T T
j ) = exp(2(xi xj − 1)) and the pa-
k=1 i∈Ck rameter λ = .5.
, where For each user we perform kernel ridge regression. Obser-
X vations are ||vjj || , for the first at most 500 most frequently
||yi − µk ||2 = (yij − µkj )2 (4)
rated movies, rated by the user. We name the method with
Gaussian kernel ”SVD KRR”.
, where Ji is the set of movies rated by user i.
For each user belonging to cluster Ck the prediction for 3.7 Linear model for each item
movie j is µkj . For a given item (movie) j we are building a weighted
Our predictor is mean prediction of ensemble of 10 runs linear model, using as predictors, for each user i, a binary
of K-means with K ranging from 4 to 24. vector indicating which movies the user rated.
We will refer to this method as ”KMEANS”. X
ŷij = mj + ei ∗ wj2 (10)
3.5 Postprocessing SVD with KNN j2 ∈Ji
The following prediction method was proposed by an anony-
mous Netflix Prize contestant. where Ji is the set of movies rated by user i, constant mj is
Let’s define similarity between movies j and j2 as cosine the mean rating of movie j, and constant weights ei = (|Ji |+
similarity between vectors vj and vj2 obtained from regular- 1)−1/2 . The model parameters are learned using gradient
ized SVD: descent with early stopping.
vjT vj2 We name this method ”LM”.
s(vj , vj2 ) = (5)
||vj ||||vj2 || 3.8 Decreasing the number of parameters
Now we can use k-nearest neighbor prediction using similar- The regularized SVD model has O(N K + M K) param-
ity s. eters, where N is the number of users, M is the number
We use prediction by one nearest neighbor using similarity of movies, K is the number of features. One idea to de-
s and refer to this method as ”SVD KNN”. crease the number of parameters is, instead of fitting ui
We also obtained good quality clustering of items, using for each user separately, to model ui as a function of a bi-
single linkage hierarchical clustering with the similarity s. nary vector indicating which movies the user rated. For
Though it was not useful for improving prediction, we men- P
example uik ≈ ei j∈Ji wjk , where Ji is the set of movies
tion it, because clustering of items can be useful in applica- rated by user i (possibly including movies for which we do
tions in recommender systems, for example to avoid filling not know ratings, e.g. qualifying.txt) and constant weights
recommendation slots with very similar items.
ei = (|Ji | + 1)−1/2 , like in the previous section. This model
3.6 Postprocessing SVD with kernel ridge re- has O(M K) parameters.
gression K
One idea to improve SVD is to discard all weights uik ŷij = ci + dj + ei vjk wj2 k (11)
after training and try to predict yij for each user i using vjk k=1 j2 ∈Ji
as predictors, for example using ridge regression. where Ji is the set of movies rated by user i.
Let’s redefine y in this section as a vector: i-th row of The second proposed model is following:
matrix y, with missing values omitted (now y is vector of
movies rated by user i). Let X be a matrix of observations - K
each row of X is normalized vector of features of one movie ŷij = ci + dj + vjk vj2 k (12)
j rated by user i: xj2 = ||vjj || . For efficiency reasons we limit k=1 j2 ∈Ji

the number of observations to 500. If a user rated more than Parameters vjk and wjk are merged and there are no con-
500 movies, we use only 500 most frequently rated movies. stant weights ei .
We can predict y using ridge regression: In both models parameters are learned using gradient de-
β̂ = (X T X + λI)−1 X T y (6) scent with regularization and early stopping, similarly to the
regularized SVD.
yˆi = xT We name the first method ”NSVD1” and the second ”NSVD2”.
i β̂ (7)
Equivalent dual formulation involving Gram matrix XX T :
β̂ = X T (XX T + λI)−1 y (8)
Table 1 summarizes the results of experiments with meth-
By changing Gram matrix to a chosen positive definite ods described in the previous sections.
matrix K(X, X) we obtain the method of kernel ridge re- Combining results of RSVD2 method with six simple pre-
gression. It is equivalent to performing ridge regression in a dictors called BASIC gives RMSE .9039 on the test set and

Test RMSE Test RMSE Cumulative 7. REFERENCES
Predictor with BASIC with BASIC test RMSE
[1] K. Ali and W. van Stam. Tivo: making show
and RSVD2
recommendations using a distributed collaborative
BASIC .9826 .9039 .9826 filtering architecture. In W. Kim, R. Kohavi, J. Gehrke,
RSVD .9094 .9018 .9094 and W. DuMouchel, editors, KDD, pages 394–401.
RSVD2 .9039 .9039 .9018 ACM, 2004.
KMEANS .9410 .9029 .9010
[2] J. Bennett and S. Lanning. The Netflix Prize.
SVD KNN .9525 .9013 .8988
Proceedings of KDD Cup and Workshop, 2007.
SVD KRR .9006 .8959 .8933
LM .9506 .8995 .8902 [3] J. S. Breese, D. Heckerman, and C. M. Kadie.
NSVD1 .9312 .8986 .8887 Empirical analysis of predictive algorithms for
NSVD2 .9590 .9032 .8879 collaborative filtering. In G. F. Cooper and S. Moral,
editors, UAI, pages 43–52. Morgan Kaufmann, 1998.
* NSVD1 — — .8879 [4] K. Y. Goldberg, T. Roeder, D. Gupta, and C. Perkins.
SVD KRR Eigentaste: A constant time collaborative filtering
* NSVD2 — — .8877 algorithm. Inf. Retr., 4(2):133–151, 2001.
[5] G. Gorrell and B. Webb. Generalized hebbian
Table 1: Linear regression results - RMSE on the algorithm for incremental latent semantic analysis.
test set Proceedings of Interspeech, 2006.
[6] G. Linden, B. Smith, and J. York. Amazon.com
recommendations: Item-to-item collaborative filtering.
.9070 (4.67% improvement over Netflix Cinematch) on quali- IEEE Internet Computing, 7(1):76–80, 2003.
fying.txt, as reported by the Netflix Prize evaluation system. [7] B. Marlin. Collaborative filtering: a machine learning
Linear regression with all predictors from the table gives perspective. M.Sc. thesis, 2004.
RMSE .8877 on the test set and .8911 (6.34% improvement) [8] R. Salakhutdinov, A. Mnih, and G. Hinton. Restricted
on qualifying.txt. Boltzmann Machines for collaborative filtering.
The predictors described in this paper are parts of a so- Proceedings of the 24th International Conference on
lution which scores .8844 on the qualifying dataset – that is Machine Learning, 2007.
7.04% improvement over Netflix Cinematch. The solution [9] B. Webb. Netflix update: Try this at home.
submitted to the Netflix Prize is the result of merging in http://sifter.org/∼simon/journal/20061211.html, 2006.
proportion 85/15 two linear regressions trained on different
training-test partitions: one linear regression with 56 pre-
dictors (most of them are different variations of regularized
SVD and postprocessing with KNN) and 63 two-way inter-
actions, and the second one with 16 predictors (subset of the
predictors from the first regression) and 5 two-way interac-
tions. In the first regression the test set is random 15% of
probe.txt, and in the second – 1.5% of probe.txt.
All experiments were done on a PC with 2GHz proces-
sor and 1.2GB RAM. Running times varied from 45min for
SVD KNN to around 20h for RSVD2.

We described a framework for combining predictions and
described methods that combined together give a good pre-
diction for the Netflix Prize dataset.
Possible further improvements of the solution presented:

• apply cross-validation-like solution described in chap-

ter 2 – repeat calculations on different training-test
partitions and merge the results.

• add different efficient predictors to the ensemble. Good

candidates are methods already applied with success
to collaborative filtering: Restricted Boltzmann Ma-
chines [8] and other graphical models [7].

Thanks to Netflix for releasing their data and organization
of the Netflix Prize. Thanks to Simon Funk for sharing his
approach of using regularized singular value decomposition.
Also, I would like to thank Piotr Pokarowski for the course
Statistics II.