Académique Documents
Professionnel Documents
Culture Documents
Abstract
In this paper, we build a movie rating prediction system on selected training sets provided
by MovieLens. We applied two different dimensionality reduction algorithms: K-means
and Stochastic Gradient Descent. It can be used to predict the rating of a user based on an
unrated movie. RMSE (Root-Mean-Squared-Error) is applied as the evaluating criteria to
analyze the performance of these two algorithms. Also, different parameters of the
Stochastic Gradient Descent method are applied to analyze the effects of each parameter
on the rating results. Finally, the performance of these two different algorithms are
compared and analyzed.
Introduction
Movie is becoming a more and more important entertainment in people’s life. There are
many choice of movies, and what would a person may like is an interesting topic. In this
project, we are building a movie rating predicting system. That is, when a certain user has
reviewed some movies, this system would predict ratings of other movies that he/she has
not reviewed. The result may be used for recommendation on some video website.
We use a 100k MovieLens dataset collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 1997 to April 1998.
The data includes 100,000 ratings among 943 users and 1682 movies. 90% of the data is
considered as the training data to build this movie rating prediction model using the
dimensionality reduction algorithm. The other 10% dataset is used as test data set to
evaluate the model using RMSE criteria. This dataset records different users’ ratings for
different movies (rating score from 1 to 5). This data set consists of:
• 100,000 ratings (1 - 5) from 943 users on 1682 movies (90% would be 90000 ratings)
• The genres of the 1682 movies
1
The structure of these matrices are shown below:
Denoted as R
Denoted as M
The problem of this paper here comes, as how to predict a particular user’s rating on a
movie that he/she may has not seen yet. This paper aims to find out these missing ratings
so that it can be used to recommend personalized movies to different users.
Method
Data Pre-processing and Problem Formalized
As mentioned before, the whole data has 100,000 ratings, which is split into a 90%
training set and 10% test set. The training and testing data set are pre-processed as
follows: Users are assigned as user ID from 1 to 943, while movies are assigned to have a
item ID be 1 to 1682. So the data can be transformed, as a 1682 * 943 matrix with each
corresponding rating is the value in the matrix, which is a very sparse matrix with lots of
missing ratings.
𝑟!,! , 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔𝑠
𝑅!,! =
0, 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔𝑠
Thus our work is to fill all the zero-entries in the matrix based on the training set data.
After all the zero-entries are filled out, the widely used RMSE
(Root-Mean-Squared-Error) criteria are used on the test data set to evaluate the
performance of our training algorithms. To be clear, the RMSE is calculated using the
test data set by comparing the real rating and our predicted rating using the following
formula:
2
! !
RMSE = ! !!!(𝑌! - 𝑌! )!
3. After that, we group the rows of this matrix based on user ID, then we can calculate
the user’s average rating for each cluster.
3
4. Once the average rating of each user for each cluster is generated, the rating matrix R
can be filled out as follows. For a particular user with user ID u and a movie with
movie ID m, user u has not watched movie m before, the predicted rating Rum is the
average rating of u on the cluster where movie m belongs.
5. The final step is to apply the RMSE formula mentioned before to calculate the RMSE
to evaluate the result.
Where R = Q * PT
Ø Where rxi is the ith entry of the xth row in the matrix rating
Ø qi the ith row of matrix Q
Ø px the xth column of matrix P
Ø k is the number of columns in matrix Q
The first idea is to apply SVD to get P and Q based on the training matrix R. This will
minimize the SSE of the training data set. But the model trained based on minimizing the
training error will be over-fitted and not to generalize well to the test data.
R = U∑VT = Q * PT , so that Q = U; PT = ∑VT
To solve the over-fitting problem, intuitively, we should not use a large number of k (k is
the number of columns of matrix Q) as the training matrix R has lots of missing ratings.
So we need to introduce regularization to the problem to consider the number of k. The
formula is shown below, it can be seen that the number of k should shrink sharply when
4
lots of data is missing in the training data matrix. 𝜆 is the regularization parameter and
μ is the learning rate.
𝑚𝑖𝑛 (𝑟!" − 𝑞! 𝑝! )2 + μ𝜆( ||𝑞! ||! + ||𝑝! ||! )
In order to achieve minimum objective, Stochastic Gradient Descent method is applied:
1. Initialize P and Q by apply the SVD of rating matrix (with missing values)
R = U∑VT = Q * PT
Q=U
PT = ∑VT
2. Then iterate over the training rating matrix R and update all P and Q
𝜀!" = 2 * (rxi – qi*px)
qi = qi + µ (𝜀!" *px –𝜆*qi)
pi = px + µ (𝜀!" * qi –𝜆* px)
The learning rate needs to be chosen carefully in order to get an optimal solution for this
problem. In order to choose a good learning rate, we tried the values from 0.1 to 0.01, and
calculate the SSE based on the training data for each learning rate.
After this, we find the optimal value for learning rate and then the SGD model is applied
based on the training data and get the P, Q to reconstruct the matrix R and calculate the
RMSE using the method described above.
Finally, another interesting thing is to apply different 𝜆 𝑎𝑛𝑑 𝑘 values to calculate both
the training and test data set SSE. By examining the errors, it can be shown clearly that
introducing regularization into the problem decrease over-fitting issue, which is also in
consistent with our intuition mentioned before.
Results
K-means clustering Algorithm
3 clusters are the best.
RMSE is 1.15.
5
the training error has a minimum. After we get the optimal learning rate, we can apply the
stochastic gradient descent methods mentioned above to get the matrix P and Q and
reconstruct the matrix R. After that, we calculate RMSE is 0.95 after 40 iterations.
6
Comparison
Clearly, the stochastic gradient descent method gave a better value of RMSE, which can
be analyzed in the following two aspects. First of all, k-means clustering only considered
the movie genre to cluster all the movies. Each user has the same rating on all the movies
belonging to the same cluster. However, the Stochastic Gradient Descent method took the
relationship of both user and movie into consideration. Additionally, comparing to
number of clusters in k-means clustering method, in the Stochastic Gradient Descent
method, the parameter k, which is the number of columns of matrix Q, denoted the
number of factors considered for each movie. Therefore, in comparison with the k-means
cluster with only 3 clusters, the SGD method considered 20 factors for each movie.
Conclusions
To conclude, this project has successfully computed the missing values for a very sparse
matrix by dimensional reduction algorithms. K-means clustering and stochastic gradient
descent algorithm were applied, and the second one got the better result. Results of this
work can be used to recommend movies to users based on predicting user’s rating on a
particular movie. In future, the SGD (Stochastic Gradient Descent) can be explored more
by applying different parameters and optimize the results.