Académique Documents
Professionnel Documents
Culture Documents
Mounir M. Bendouch
Erasmus Universiteit Rotterdam
Abstract
3 Data 6
4 Methodology 10
4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4.1.1 Natural Language Processing . . . . . . . . . . . . . . . . 10
4.1.2 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 Domain Ontology . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Similarity Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.5.2 Related Features . . . . . . . . . . . . . . . . . . . . . . . 22
4.5.3 Logistic Transformation . . . . . . . . . . . . . . . . . . . 26
4.5.4 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Experimental Set-up . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.6.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Results 37
5.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6 Conclusion 46
Acknowledgements
I thank Flavius for his excellent supervision and effort to always review my work
carefully. Thanks to my friend Sarunas for his help and advice. I especially
thank my father for sparking my interest in mathematics, and my mother for her
amazing support.
M.M. Bendouch Master Thesis
1 Introduction
Vast amounts of information have become available through the emergence of the
Web [35]. The total amount of data has experienced an accelerating increase ever
since. For example, in just the two years leading up to 2013, the combined size of
all data on the Web grew by a factor of 10 [25]. The vastness of the Web enables
users to explore an immense variety of content, and virtually every niche and
taste for articles, movies, music, and so on has become just mouse-clicks away.
However, this abundance of choice comes at the price of information overload,
and finding the right content has become exceedingly time consuming.
Recommender systems [21] have emerged as a solution to this problem, filling
the need to filter and deliver relevant content to the user by sorting through large
amounts of information and presenting the most interesting selection in the form
of recommendations. This goes beyond plain information retrieval systems such
as search engines because recommender systems incorporate the users prefer-
ences, interests, and needs into the filtering process. Based on this information,
they attempt to predict the rating or preference the user would give to each of
the unseen items under consideration, and recommends those for which this pre-
diction is highest. As these systems are now widely used in areas such as movies,
news, articles, and e-commerce, they have become increasingly relevant.
High performance recommender systems can be invaluable to an on-line con-
tent provider by increasing user satisfaction, because content that better matches
individual user preferences can be recommended. For advertisement driven or
pay-per-view businesses, this can boost revenues substantially by increasing view-
ing time and clicks. For subscription based businesses, the increased satisfaction
can lead to higher popularity and loyalty.
Two different approaches to recommender systems [21] can be distinguished:
collaborative filtering and content-based filtering. These two types of systems
differ in the data and underlying assumptions they use for their predictions. Col-
laborative and content-based filters can also be integrated into a single system
these are called hybrid systems. Collaborative filtering approaches use informa-
tion on which items were selected or rated by the user in the past, and compare
this to the history of other users. The main assumption of collaborative filtering
is that if two users have similar opinions on a particular issue, they will likely
have a similar opinion on another issue. In other words, collaborative filtering
matches users with similar tastes and uses this information to make predictions.
Content-based filtering, on the other hand, uses similarities between the content
of items as opposed to similarities between different users. In other words, it uses
information about the items themselves, and assumes that users like items that
have similar contents, independent of the opinions of others. The unseen items
that are most similar to the items in the user profile are therefore recommended
to the user. The information about the item contents is typically represented in
Page 1 of 50
M.M. Bendouch Master Thesis
Page 2 of 50
M.M. Bendouch Master Thesis
2.1.1 TF-IDF
The most frequently used feature vector is called Term Frequency - Inverse Doc-
ument Frequency (TF-IDF) and such vectors are generally compared through
cosine distance to obtain similarities. To construct the feature vector for a doc-
ument, for each term in the the text it calculates the term frequency (TF) and
multiplies this by its inverse document frequency (IDF) in all documents. TF is
a count of each term in the document, and this count is then normalized for each
document by considering the total number of terms in that document. Therefore,
the TF-value for a specific term is an indicator of the relative importance of this
term in the document. The other value, IDF, is the inverse of the count of each
term in the total collection of documents, indicating the overall uniqueness of the
term. This value is constant over all documents and can be seen as a weight that
gives relative importance to rare terms. A pre-processing step before counting
terms is performed to remove noise and increase performance. Stop words are
removed and all other words are lemmatized, so that all words with the same
root are considered to be the same term [11].
Page 3 of 50
M.M. Bendouch Master Thesis
2.1.2 CF-IDF+
The Concept Frequency - Inverse Document Frequency (CF-IDF) is comparable
to TF-IDF, but it uses concepts instead of terms. The text is processed by a
natural language processing (NLP) engine that performs word sense disambigua-
tion, part-of-speech (POS) tagging, and tokenization to transform the text into
a collection of concept candidates. A domain ontology containing concepts and
their relationships is checked for each candidate, and if a match is found, a count
is added to that concept. Using concepts instead of terms represents the domain
semantics better because it only considers words that are relevant to the specific
domain, and it subsequently results in an observed performance improvement
over TF-IDF [11]. CF-IDF+ extends this method further by, for each concept
found in the text, adding the concepts to the vector that are directly related
in the domain ontology [7, 31]. Each type of relationship (superclass, subclass,
or instance) is given a weight to vary the overall importance of the found con-
cepts and their related concepts. These three weights are then optimized by grid
search. Including the related concepts can add more domain semantics to the
feature vector.
2.1.3 SF-IDF+
The same principle of CF-IDF is applied with Synset Frequency - Inverse Docu-
ment Frequency (SF-IDF) [2], by matching terms to synsets in a semantic lexicon,
in this case WordNet. This generally results in a longer vector than CF-IDF be-
cause more matches are found as WordNet is much larger than a typical domain
ontology. As CF-IDF+ adds related concepts, so does SF-IDF+ add synsets that
are directly related in WordNet [18]. There are 27 types of semantic relationships
in WordNet, and as in CF-IDF+, every type has a weight that is optimized, but
by a genetic algorithm due to the much larger search space. SF-IDF+ outper-
forms SF-IDF in [18].
2.1.4 Bing-SF-IDF+
While SF-IDF+ retrieves many synsets from the texts, semantic lexicons do not
contain named entities, leading it to consistently discard this information. Bing-
SF-IDF+ recognizes the named entities from the text and uses them in a separate
similarity measure between each pair of entities, called Bing distance [3]. This
measure is a function of three search result page counts: two counts for each
entity separately and one for a combination of the two. An optimized weight is
used to combine the Bing distance and the SF-IDF+ cosine similarity. This leads
to improved performance over SF-IDF+ [3].
Page 4 of 50
M.M. Bendouch Master Thesis
For semantics-driven recommenders that are based on concepts and their rela-
tionships, a domain ontology is needed. We will therefore also answer:
And since these recommenders have previously been applied only to small datasets:
Page 5 of 50
M.M. Bendouch Master Thesis
3 Data
The problem that we focus on in this research is the large-scale recommendation
of movies, and the GroupLens Research Project at the University of Minnesota1
makes datasets freely available for this purpose. As scaling up the approach is
part of our research question, we gather user ratings from the largest dataset
available, the MovieLens 20M2 dataset, describing 5-star user rating activity
from MovieLens3 , an on-line movie recommendation service. It contains a total
of 20,000,000 user ratings on a scale of 1 to 5 across 27,278 movies. It was
collected in the ten-year period of 9 January 1995 until 31 March 2015 from the
rating activity of 138,493 users, who had all rated at least 20 movies.
As semantics-driven recommender systems are content-based, they require, in
addition to the user ratings, item-level information as input for feature extraction.
The MovieLens data contains the title, year of release, and genre labels for each
movie. Identification numbers for The Internet Movie Database (IMDB)4 are also
attached, which allows us to augment this dataset with movie-level information
from other sources.
We use two other sources to collect movie information, and the first is the
Open Movie Database (OMDb)5 , which freely provides an API in the form of a
RESTful Web service. As IMDB ids can be used as query keys, we can match the
information to the MovieLens movies. OMBd makes movie posters available only
to patrons, so we choose to query the Movie Database (TMDb)6 API, our second
source, for this purpose instead. The posters are freely available to anyone with
a free user account, and can be requested by their IMDB ids.
The combined data contains many movie-level variables, but for this research
we choose to retain only those containing substantial semantic information, as
those could be valuable for semantics-driven recommendations. We believe the
bulk of semantic information to be represented by the names of persons involved
in the movie, the genre(s), the plot, and the poster. The involved persons are the
actor(s), director(s), and writer(s). Table 1 shows the variables we use in this
research together with their descriptive statistics.
OMDb provides us with a plot for 96.51% of the movies in the MovieLens
data, containing 63 words on average. These are the only texts in our data from
which semantic features can be extracted. We therefore discard the 3.49% of
movies for which no plot is available, which leaves us with 26,327 movies. We
notice that the plots are substantially shorter than typical news articles, and this
might reduce the amount of semantic information that can be extracted from
1
https://grouplens.org
2
http://grouplens.org/datasets/movielens/20m/
3
https://movielens.org/
4
http://www.imdb.com/
5
http://www.omdbapi.com/
6
https://www.themoviedb.org/
Page 6 of 50
M.M. Bendouch Master Thesis
the texts. The plots describe the storyline without revealing any spoilers and
are similar in their intent to a movie trailer or the description one might find on
the back of a novel. An example of a plot for the first movie in our dataset, the
animation film Toy Story (1995), illustrates this:
A little boy named Andy loves to be in his room, playing with his toys,
especially his doll named Woody. But, what do the toys do when Andy
is not with them, they come to life. Woody believes that he has life (as
a toy) good. However, he must worry about Andys family moving,
and what Woody does not know is about Andys birthday party. Woody
does not realize that Andys mother gave him an action figure known
as Buzz Lightyear, who does not believe that he is a toy, and quickly
becomes Andys new favorite toy. Woody, who is now consumed with
jealousy, tries to get rid of Buzz. Then, both Woody and Buzz are
now lost. They must find a way to get back to Andy before he moves
without them, but they will have to pass through a ruthless toy killer,
Sid Phillips.
In this specific 146-word plot, terms such as boy, toys, doll, birthday, and play
clearly contain semantic information pointing towards childrens movies and
could be predictive of user preferences. Synsets can be extracted by matching
these terms to a semantic lexicon, and their related synsets would likely provide
additional value. Whether these or other terms can be matched to concepts in
a domain ontology for use in CF-IDF and related recommenders is doubtful,
as any ontology containing these concepts would likely have to be too large to
feasibly construct manually. We further notice that named-entities present in
the plots are generally names of fictional characters and would only rarely pro-
vide substantial information in Bing distance evaluations. For example, the Bing
Page 7 of 50
M.M. Bendouch Master Thesis
Page 8 of 50
M.M. Bendouch Master Thesis
discard any movie without at least one director, actor, writer, and genre.
We conclude that the combined dataset contains substantial semantic infor-
mation, and after discarding all movies that have one or more missing value(s) in
any of the variables, we retain 24,477 movies to be used for this research. We can
reasonably expect these variables to be available to a recommender system and
removing said movies therefore does not impact the general applicability of our
approach. Furthermore, this affects only 0.83% of the user ratings, so it seems
that the discarded movies had few ratings to begin with. We believe that Bing
distance metrics can be applied to the persons involved with the movie, such as
the directors, actors, or writers as they are named entities and similarities among
them are relevant to user preferences, in contrast to character names from the
plots. The titles of the movies contain a negligible number of terms or synsets,
but can however also be used as a named entity to calculate Bing distances.
In this research, we already represent all these entities as concepts in a domain
ontology, and we also note that the large number of named entities makes the
pair-wise search queries infeasible. We therefore decide to exclude Bing distances
from our recommender system and focus on semantics from terms, concepts, and
synsets, in line with TF-IDF [11], CF-IDF(+) [7, 11, 31], and SF-IDF(+) [3].
Page 9 of 50
M.M. Bendouch Master Thesis
4 Methodology
This section covers our research methods, starting with the extraction of semantic
features from both plots and posters. We then describe how we can find related
concepts without the need for an external domain ontology. Subsequently we
define the similarity model, and show how a small modification allows massive
scalability. We derive the gradient and explain its use in stochastic gradient
descent (SGD). The chapter is then completed with a description of our procedure
for sampling observations, training, validation, and evaluation.
Page 10 of 50
M.M. Bendouch Master Thesis
algorithm, to each word. WSD addresses the problem of identifying the sense of
a word - the meaning in its context. For example the noun bank has multiple
meanings that are very different and the meaning in a text has to be judged from
the context in which it is used. Adjusted Lesk does this by calculating a similarity
between the context (sentence) of the word in the text and the definition of each
sense of the word from the dictionary (in our case WordNet). The sense with the
highest similarity is then identified. We consider only senses that have the same
POS tag as the word from the text. Only if no sense can be found, we consider
all senses with any POS. The synset containing the identified sense of the word
is extracted.
Page 11 of 50
M.M. Bendouch Master Thesis
each node is fully connected to each node of the next layer. In other words, the
signals are fed to an input layer, propagated through multiple hidden layers, to an
output layer. The number of hidden layers is often referred to as the depth of the
model and ANNs with many hidden layers are also called deep neural networks
(DNNs). Because of this regular connection pattern, the weights connecting one
layer to the next can be represented in a matrix, which allows for efficient parallel
computations. We explain this by considering one layer - say layer1 - of d1 nodes
and assuming we know its vector a~1 of d~1 output activations. We further assume
that layer2 has d2 nodes. We denote F2 () as the activation function of the nodes
in layer2 . We introduce a d2 d1 weights matrix W2 and a bias vector b~2 of length
d2 , both consisting of learnable parameters (weights). The output activations a~2
of layer2 are then calculated as follows:
Page 12 of 50
M.M. Bendouch Master Thesis
In order to use the image matrix as input for an MLP, it needs to be flat-
tened to a vector of length 3hw, because the layers arrange the nodes in only one
dimension. This also means that the model does not scale well to high-resolution
images as the number of connections between the input layer and the fully con-
nected hidden layer explodes. Furthermore, the spatial information of the pixels
is lost and has to be implicitly learned through training.
Convolutional neural networks (CNNs) are types of feed-forward ANNs that
can make use of spatial ordering through arranging their layers of nodes in three
dimensions. Each layer can be seen as a collection of different learnable filter
kernels that are convolved across the height and width of the entire image and
extend through the full depth - which is 3 for RGB images. Each filter is applied
to the full image, with the same weights, so the number of weights are indepen-
dent of the image size. A convolutional layer contains multiple of these filters,
and each filter outputs a 2-dimensional map of the results of its convolutions over
the image. If the filters are of shape k k, we say that the receptive field is k
pixels, and usually k {3, 5, 7}. The resulting feature maps from the first layer
will be slightly reduced to spatial size (h k + 1) (w k + 1) as the filters can
only be applied to k k pixels. If we have n filters in a layer, we can represent
the features maps in an n(hk + 1)(w k + 1) matrix to be used for the next
layer, where again different filters can be applied, this time with depth n. This
gradually reduces the size of the spatial activation matrix as it is propagated
through the layers. Using stride of s in the convolutions means the filters are
applied only to each s-th pixel so the resulting maps are reduced more rapidly to
((h k)/s + 1) ((w k)/s + 1). Another operation called zero-padding adds a
border of p pixels to each of the four sides of the spatial input. If zero-padding
is applied with p = (k 1)/2 pixels on each side, the output map will remain the
same size as the input map. In between convolutional layers, max-pooling layers
can be used, which slice an n h w spatial input into tiles of size m m (com-
monly m = 2), over which only the maximum value is propagated. The output
h w
of a max-pooling layer is therefore of size n m m . These introduce the benefit
of reducing the size of the maps and providing some form of spatial invariance.
After some number of hidden convolutional and max-pooling layers, the output is
flattened to an nhw-length visual feature vector, where h and w are substantially
smaller than the original input image size. The flattened layer connects, usually
through additional fully connected layers, to the vector-valued output layer. This
output layer can for example have a softmax activation function in the case of
image classification.
Page 13 of 50
M.M. Bendouch Master Thesis
Figure 2: Crop three windows of 224 224, predict class (synset) probabilities
for each, feature values are the maximum probabilities of each synset.
sify an image in 1, 000 categories that are each represented by a synset. In 2014,
the highest-performing submission was a 19-layer deep convolutional neural net-
work from the Visual Geometry Group of the University of Oxford [24] called
VGG19. On the test set, for 81.1% of the images the top-5 predictions included
the correct class. Human performance on this metric is estimated to be around
88-95% [23]. The trained parameters for this model are publicly available on the
web-page12 . VGG19s convolutional layers each have a filter size of 3 3 and the
input to each of those layers is zero-padded with p = 1 such that the ouputs are
of equal spatial dimensions. Down-sampling occurs only through max-pooling
layers. Two fully connected layers are added and connected to a 1,000 dimen-
sional softmax output layer. As substantial semantic content of the posters can
be described by the objects that can be recognized from them, we can use VGG19
to extract meaningful synset vectors. The model takes a 224 224 colour image
as input, represented as a 224 224 3 matrix of RGB pixel values. It outputs
a vector of 1, 000 probabilities, one for each synset. The posters are larger than
224 224 pixels and are higher than they are wide, so we first downscale them
while keeping the aspect ratio such that the width becomes 224. The height is
then still larger than 224 but never larger than 3 224 so we can take 3 vertically
overlapping 224 224 windows of the poster as inputs to ensure every part of the
image is covered (Fig. 2). We evaluate the model on each window, after which
we take the maximum of the 3 output values for each class (synset). We apply
this procedure to the posters to obtain feature vectors of 1, 000 synset values.
12
http://www.robots.ox.ac.uk/vgg/research/
Page 14 of 50
M.M. Bendouch Master Thesis
where s(m,
~ ~c) = m~
~ c is the scoring function. As [14] first scale the embedding
Page 15 of 50
M.M. Bendouch Master Thesis
Page 16 of 50
M.M. Bendouch Master Thesis
Table 2: Notation.
through relation a.
Mb
a
{0, 1} (i, j)-th element: whether any item related to item i
(i,j)
by relation a has concept j.
This also means the sum of row i is the number of concepts found in item i, and
the sum of column j is the number of items in which concept i is found. All these
sums are at least 1 because all concepts occur at least once and every movie has
at least one concept of each class. An example of a matrix Mc in our case is
Mdirector , a binary 25, 138 12, 231 matrix that encodes the directors found in
each of the movies.
We further denote with conceptc(j) the j-th concept of class c and the i-th
item with itemi . We show that the k matrices Mc can be used to find related
concepts for an item through relations of the form:
contains occurs in contains
itemu concepta(i) itemv conceptb(j)
(3)
a, b classes u, v [1, z] i [1, na ] j [1, nb ]
Suppose itemu contains a concept of class c. If we find another itemv that
contains that same concept, we say that itemv is related to itemk through relation
c. The number of possible relations is therefore equal to the number of classes
k. For example, if we find that an actor from movie u also plays in movie v, we
call these movies related through the relation actor. Note that the relations are
bidirectional and that a movie is always related to itself.
This existence of a relation c between itemu and itemv is equivalent to the
existence
Pnc of a j [0, nc ] for which Mc(u,j) M
Pc(v,j) = 1. This is the case if and only if
nc
j=1 Mc(u,j) Mc(v,j) > 1. The expression j=1 Mc(u,j) Mc(v,j) is also the definition
of the dot-product between the u-th and v-th rows of Mc . If we calculate the
Page 17 of 50
M.M. Bendouch Master Thesis
through relation a. It encodes the related concepts of all items since Mb(u,j) > 1
a
We can also express Eq. 4 as a function of two arbitrary matrices with same
number of rows:
getRelated(A, B) = AA> B (5)
Using the function from Eq. 5 in a nested way, we can obtain related concepts
through longer paths. For example, we can obtain concepts through the path
(notation simplified):
item
concepta
item
conceptb
item
conceptc (6)
4.3 Scaling
When all features are extracted from the descriptions, we have to consider how to
scale them. The traditional method of scaling of TF-IDF is the Inverse Document
Page 18 of 50
M.M. Bendouch Master Thesis
Frequency (IDF) as explained in Section 2.1.1, and we apply the same scaling to
terms and synsets from the plots, in line with SF-IDF(+).
The scaling of concepts is slightly different because, in contrast to CF-IDF+,
these are not extracted from texts. Frequencies are always in {0, 1} as we only
extract occurrences of concepts. We do not apply IDF scaling, as we believe that
the interpretation of the feature values deviates too much from their original
meaning in TF/SF/CF-IDF+, where they are relative frequencies from texts.
The 1,000 synset values (VGG19) and the 1,024 visual-semantic feature values
(VSE) extracted from the posters could benefit from scaling as we expect that
some features are more relevant to the content of the movies. These features
should play a larger role in the cosine distance and therefore be scaled higher. We
have little information about the relevance of each of the 1,000 synsets, and even
less so about the 1,024 visual-semantic features, which are hidden and do not have
a natural interpretation. To investigate the upper limit of benefit from scaling, we
learn 1,000 scales for the synsets and 1,024 scales for the visual-semantic features
simultaneously with optimizing the model through stochastic gradient descent.
Following the notation of the similarity model in Table 4, we denote the scale
as c~i if it applies to the i-th feature type ti . So c~i R1,000 ti = V GG19 and
c~i R1,024 ti = V SE. The user-profile vector ui and the unseen item vector
v~i are then scaled through c~i u~i and c~i v~i respectively, with the element-wise
product.
P These resulting scaled vectors are used in the cosine. We restrict c~i > 0
and c~i = 1 to avoid the over-parametrization caused by cos(~u, ~v ) = cos(~u, ~v )
6= 0. Next to the scaled vectors, we also use the unscaled original vectors in
the model for comparison.
The results of the learned visual scaling could also indicate the benefit of
better scaling of the other feature types, for which this learning procedure is too
computationally expensive. Another advantage of applying this experiment only
to the visual features is that the number of features is fixed, so the resulting
learned scaling can be reused for other movie recommendation problems.
4.4 Preparation
After the features are extracted from the descriptions and appropriately scaled,
we take some steps to prepare them for use in the similarity model. For each
movie, we have a feature vector for each type of feature. We increase the para-
metric freedom compared to CF-IDF+ by placing concepts of each class into
separate vectors, which allows their relative importance to be learned. So the
concepts (features) of each class are considered a different type of feature. All
feature types used in this research are summarized in Table 3.
We use the notation (Table 4) used for the similarity model in the next section.
Let us denote the number of feature types as k = 9 and the set of feature types as
t, where ti is the i-th type, for example t1 = director. We represent the feature
values of type ti for item g [1, z] in a matrix Vig Rmi ni with z the total
Page 19 of 50
M.M. Bendouch Master Thesis
Page 20 of 50
M.M. Bendouch Master Thesis
4.5.1 Definition
Semantics-driven recommenders such as CF-IDF+, SF-IDF+, and Bing-SF-IDF+
have in common that all three methods construct one vector of features from the
user profile and one from the unseen item under consideration. To then calculate
similarity between the user profile and the item, the cosine similarity between
these two vectors is used. For CF-IDF+, these features are computed as a func-
tion of the frequency of concepts in the text and concepts related to them in
the domain ontology. SF-IDF+ does the same with synsets and retrieves the
relations from a lexical ontology. Bing-SF-IDF+ calculates a weighted average
Page 21 of 50
M.M. Bendouch Master Thesis
k
X
sim = wi si = w
~ ~s (7)
i=1
~a~b
cos(~a, ~b) = (8)
||~a||2 ||~b||2
As in our model si = cos(u~i , v~i ) we substitute in Eq. 7 for the definition of cosine
similarity from Eq. 8 to get:
k
X k
X
sim = wi si = wi cos(u~i , v~i ) (9)
i=1 i=1
Page 22 of 50
M.M. Bendouch Master Thesis
the SF-IDF+ value of the j-th synset and n1 is the number of unique synsets.
So here t1 = synset and if we assume for example that toy is the 5-th synset,
v15 is the unseen items SF-IDF+ value of the synset toy. This value is not only
calculated from the occurrences of toy in the unseen items text but also from
occurrences of synsets that are related to toy, for example from the occurrence of
doll. As we consider directly found features as a specific case of relation, we can
define the first relation as direct and in the SF-IDF+ model we restrict q11 = 1.
This makes the number of relations m1 = 28 because the remaining relations are
then the 27 semantic relations found in WordNet. The weights q1l for l [2, 28]
are then the 27 optimizable weights restricted to [0, 1]. We setP our restrictions
less strict than those of the original SF-IDF+, to qil > 0 and m i
l=1 qil = 1, as
we want to allow any weight to be the highest. The weights of SF-IDF+ can
be expressed in our model if we divide the weights vector by its sum, and the
similarities are unaffected by this translation because cos(~u, ~v ) = cos(~u, ~v ) for
any positive scalar .
We can define V1 and f1 by noting that the SF-IDF+ value v1j is the maximum
of the direct SF-IDF value and the SF-IDF values of synsets related to synset
j multiplied by their corresponding relation weight from q1 . In other words,
V1(l,j) is the maximum of SF-IDF values of all synsets related to synset j by
the l-th relation. Therefore the j-th column of V1 , denoted V1(,j) consists of
these 28 SF-IDF values from related synsets, one for each type of relation. Note
that the first value in the j-th column V1(1,j) is the SF-IDF value of synset j
itself. Now to replicate SF-IDF+ within our framework we also need to take a
maximum over the 28 related SF-IDF values after they have been multiplied by
their corresponding relation weights from q~1 . We therefore define f1 as follows:
Note that this is equivalent to taking the largest element of the element-wise
product of q~1 and V1(,j) denoted (q~1 V1(,j) ) R28 . This obtains the SF-IDF+
value for feature j, which is the j-th element v1j of the feature vector v~1 .
The item vector v~1 in the cosine similarity of CF-IDF+ can be expressed
similarly but uses concepts instead of synsets and ontology relations instead of
WordNet relations, and therefore V1 would consist of CF-IDF values and r~1 of
the m1 defined relations in the domain ontology. If we would follow this CF-
IDF+/SF-IDF+ method for all features in our research we could define fi more
generally for any feature type ti as:
vij = fi (~
qi , Vi(,j) ) = max qil Vi(l,j) j [1, ni ] (11)
1lmi
However, Eq. 11 reveals the computational bottleneck in the SF-IDF+ and CF-
IDF+ models because the item vector v~i that is used in the cosine similarity
consists of maxima that can be known only by using the full matrix Vi Rmi ni
Page 23 of 50
M.M. Bendouch Master Thesis
and the parameter vector of weights q~i Rmi . As we want to separate the dot-
products from the parameters, we change fi in Eq. 11 to calculate vij by taking
the sum instead of the maximum:
mi
X
vij = fi (~
qi , Vi(,j) ) = qil Vi(l,j) = q~i Vi>
(,j)
j [1, ni ] (12)
l=1
We can now see from Eq. 12 that the j-th element of vi has become the matrix
multiplication of the row-vector q~i and the transpose of column j of Vi . This
means that if we represent the vector q~i as a 1 mi matrix, we no longer need
fi , as we can perform the j multiplications that form the elements of v~i with one
matrix multiplication, using q~i and the full mi ni matrix Vi :
vi = q~i Vi (13)
Notice that the matrix Vi consists of (related) feature values for an unseen item,
but can be calculated for any item, including those in the user profiles. We denote
Vig as the item feature matrix Vi corresponding to the g-th item in a user profile,
which is a set of p liked items. As the user profiles features are defined as sums
of the p items features and the weights q~i are the same for each item, the user
profile feature vector u~i can be represented as:
p p p
v~ig = q~i Vig Vig = q~i Ui
X X X
u~i = = q~i (14)
g=1 g=1 g=1
Here we have defined Ui = pg=1 Vig , which we call the user feature matrix. Using
P
Eq. 13 and Eq. 14 for the feature vectors u~i and v~i we can rewrite the cosine
similarity for feature type i as:
(~
qi Ui )(~ qi Vi )
si = cos(~
qi Ui , q~i Vi ) = (15)
||~
qi Ui ||2 ||~
qi Vi ||2
qi Vi )>
q~i Ui (~ q>
q~i (Ui Vi> )~
si = p p =q q i (16)
qi Ui )> q~i Vi (~
q~i Ui (~ qi Vi )> q~i (Ui Ui> )~qi > q~i (Vi Vi> )~
qi >
We can now rewrite Eq. 9 for our full models similarity as:
k k
X X qi >
q~i (Ui Vi> )~
sim = wi s i = wi q (17)
i=1 i=1 qi > q~i (Vi Vi> )~
q~i (Ui Ui> )~ qi >
Page 24 of 50
M.M. Bendouch Master Thesis
Since we have defined Ui = pg=1 Vig , we can show that (Ui Vi> ) and (Ui Ui> ) can
P
both be written as sums of matrix multiplications of the p feature matrices Vig
of liked items and the feature matrix Vi of the unseen item:
p p
g
(Vig Vi> )
X X
Ui Vi> =( >
Vi )Vi = (18)
g=1 g=1
p p p p
Vig )( (Vig ))> = (
X X X X
Ui Ui> = ( Via )( (Vib )> ) =
g=1 g=1 a=1 b=1
p p p p
(19)
X X XX
(Via (Vib )> ) = Via (Vib )>
a=1 b=1 a=1 b=1
Equation 17 shows how sim is function of (Ui Vi> ), (Ui Ui> ), (Vi Vi> ) Rmi mi
and the parameter vectors q~i Rmi , w ~ Rk . We know from Eq. 19 that Ui Ui>
and Ui Vi are simply sums of Vi (Vi )> for some a, b [1, z] and as described
> a b
Learned scaling. Note that when we learn scaling for the visual feature types
VGG19/VSE as described in Section 4.3 we insert u~i (~ ci u~i ) and v~i (~
ci v~i )
in the similarity model, where c~i Rni is the learnable scaling. Note also that
u~i = Ui , and v~i = Vi , because the number of relations mi = 1 for these feature
types. Therefore the part-similarity model of Eq. 16 changes to:
ci u~i )(~
q~i ((~ qi >
ci v~i )> )~
si = q q (20)
ci u~i )> )~
ci u~i )(~
q~i ((~ qi > q~i ((~ qi >
ci v~i )> )~
ci v~i )(~
Pmi
We restrict l=1 q
~i l = 1 and here mi = 1, making q~i = 1 and redundant:
ci v~i )>
ci u~i )(~
(~
si = p p (21)
ci u~i )> (~
ci u~i )(~
(~ ci v~i )>
ci v~i )(~
Page 25 of 50
M.M. Bendouch Master Thesis
Example. To see how the proposed efficiency improvement works out in prac-
tice, we revisit the SF-IDF+ model. In our case, we retrieve 18 relations from
WordNet and the total number of unique synsets (including the related) found
in the plots is 69,977. Every movie therefore has a feature matrix V1 of size
19 69, 977 containing the (related) SF-IDF values for that movie. In the tradi-
tional model, we would have to multiply each row of the feature matrix with its
corresponding relation weight and take the maximum over the columns to obtain
the SF-IDF+ vector v~1 of length 69, 977. We use p = 5 items per user-profile,
so we would have to repeat the above computation 5 times and calculate the
sum of the 5 vectors to obtain the SF-IDF+ user profile vector u~1 . To then
calculate the part-similarity s1 we would calculate the 69, 977-dimensional cosine
similarity cos(u~1 , v~1 ). This entire operation would have to be repeated for each
observation while optimizing, and we have 1,406,976 observations (Section 4.6.1)
in our training set.
With the approach proposed in this section, we prepare V1a (V1b )> for all combi-
nations of movies a, b [1, z] through a single matrix multiplication as described
in Section 4.4. Due to the efficiency of performing this in a single operation, we
obtain X1 within minutes. Given X1 , dimensionality is reduced from 69, 977 to
18 because we now only require V1a (V1b )> , which are blocks of size 19 19 in X1 .
For each observation in the training set, we can obtain V1 V1> for the unseen item
directly as an 19 19 block on the diagonal of X1 . For U1 V1> and U1 U1> we need
to sum p = 5 and p2 = 25 different blocks respectively. These operations can
be performed before optimization as well, as explained in Section 4.6.1. Using
(U1 V1> ), (U1 U1> ), (V1 V1> ) as observation data, we can optimize the model of Eq.
16, which consists of several 19-dimensional matrix multiplications.
Page 26 of 50
M.M. Bendouch Master Thesis
The different si cannot be scaled freely either, which can lead to problems,
again even with high-quality predictors. As an obvious example consider that
we have only one part-similarity, but it is perfectly predictive by correlation:
y
s1 = 100 . With the restriction on scaling, it can only change the similarity by
1
100 units, even if mean-shifts are allowed.
To solve the above problems, we have to remove the restriction ki=1 wi = 1
P
and add a learnable bias to the model. Now sim is still linear and increasing
but less restricted. To then bound the model to [0, 1] we apply a logistic function
to the output of the model in Eq. 17:
1
sim = Pk (22)
1 + e i=1 wi si
We can now relax the previously (Section 4.5.1) imposed si > 0 restriction be-
cause 0 6 sim 6 1 si R. Since we no longer need si > 0 we can also relax
u~i , v~i > 0 for the feature vectors, allowing us to use the visual-semantic embed-
dings, which are not non-negative. We nevertheless keep wi > 0 because we still
desire sim si > 0. Note that sim remains an increasing function of wi and si after
the transformation of Eq. 22.
If the si were data inputs the model would be linear logistic regression, but si
itself a non-linear model, following Eq.16 or Eq.21. We can however, in line with
classic logistic regression, use cross-entropy, also called logloss, as a loss function
over an items observed like/dislike y {0, 1} and the predicted similarity sim
[0, 1] between the item and the user-profile:
L = y log(sim) + (1 y) log(1 sim) (23)
The similarity can therefore also be interpreted as the probability of a like given
the input data (Ui Vi> ), (Ui Ui> ), (Vi Vi> ) i [1, k].
4.5.4 Gradient
Given the model definition of Eq. 22 and the cross-entropy loss of Eq. 23
we can define the objective function J() to be minimized as y log(sim()) +
(1 y) log(1 sim()), where is the vector of parameters. To simplify the
notation in this section, we add the bias parameter to the k weights in w, ~ so
wk+1 = and sk+1 = 1. Then the parameters contain the relation weights
vectors q~i i [1, k], the weights vector w,
~ and if applicable the scaling c~i . We
now derive the gradient of the objective function:
ysim() (y 1)sim() (y sim())sim()
J() = = (24)
sim() sim() 1 sim()(sim() 1)
With S() = ki=1 wi si we can derive the gradient of sim using the quotient and
P
chain rules:
1 eS() S()
sim() = = (25)
1 + eS() (eS() + 1)2
Page 27 of 50
M.M. Bendouch Master Thesis
The part of the gradient with respect to the weights vector w ~ is here called w ,
the parts with respect to each of the weights vectors q~i are called qi i [1, k],
and the part with respect to the scaling c~i is called ci . We first derive w by
expressing it for each element wi of w:
~
k
X k
X
wi S() = wj sj = 0 + wi si = si (27)
j=1 j=1,j6=i
Before we derive qi , we simplify our notation for the data matrices A = Ui Vi> ,
B = Ui Ui> , C = Vi Vi> . Note that q~i is only learnable for feature types that use
the model for si of Eq. 16 without a learned scaling c~i . We can now write:
qi >
q~i A~
qi S() = wi si = wi q (28)
q~i B q~i > q~i C q~i >
q
qi > and g = q~i B q~i > q~i C q~i > , we can use the
If we define the functions f = q~i A~
quotient rule:
gf (g)f
qi S() = wi (29)
g2
We derive the intermediate result for f :
f = ~ qi > = A~
qi A~ qi > (30)
Substituting the results from Eq. 30 and 31 for f and g in Eq. 28 we obtain:
q
> >
qi B q~i > C q~i >
q~i B q~i > q~i C q~i > A~
qi > ( q~i C q~i B q~i +~ )~ qi >
qi A~
2 q~i B q~i q~i C q~i >
>
qi S() = wi (32)
q~i B q~i > q~i C q~i >
Page 28 of 50
M.M. Bendouch Master Thesis
This concludes the derivation for the part-similarities si that use the model of
Eq. 16. For the si model of Eq. 21 with learned scaling c~i , which do not have
relation weights q~i , we need ci instead of qi :
ci v~i )>
ci u~i )(~
(~
ci S() = wi si = wi p p (33)
ci u~i )> (~
ci u~i )(~
(~ ci v~i )>
ci v~i )(~
For clarity we temporarily denote the j-th elements of the vectors c~i , u~i , v~i as cj ,
uj , vj respectively, with j [1, ni ] and ni the number of features. For b [1, ni ]
we derive the b-th element cb si of ci si :
Pni Pni 2
j=1 cj uj cj vj j=1 cj uj vj
cb si = qP qP = qP qP
ni ni ni 2 u2 ni 2 2
c u c u
j=1 j j j j c v c v
j=1 j j j j c
j=1 j j j=1 cj vj
P (34)
A sum over all indices [1, ni ] except the b-th is denoted j6=b . We define the
functions f and g:
ni
X X
f= c2j uj vj = c2b ub vb + c2j uj vj
j=1 j6=b
ni
X ni
X (35)
X X
g= c2j u2j c2j vj2 2 2
= cb ub + 2 2
cj uj 2 2
cb vb + 2 2
cj vj
j=1 j=1 j6=b j6=b
cb f = 2cb ub vb + 0 = 2cb ub vb
X X
cb g = 4c3b u2b vb2 + 2cb u2b c2j vj2 + 2cb vb2 c2j u2j
j6=b j6=b
X X
2 2 2 2 2 2 2 2
= 2cb ub 2cb vb + cj vj + vb cj uj
j6=b j6=b
ni
X X
2 2 2
= 2cb ub cb vb + 2 2 2
cj vj + vb 2 2
cj uj (36)
j=1 j6=b
X ni X
2 2 2 2 2 2 2 2
= 2cb ub cj vj + vb cb ub + cj uj
j=1 j6=b
ni
X Xni
= 2cb u2b c2j vj2 + vb2 c2j u2j
j=1 j=1
Page 29 of 50
M.M. Bendouch Master Thesis
Pni
Rewriting cb si of Eq 34 with f and g we obtain (notation
P j=1 simplified to
):
f 2gf f g
cb si = = p
g 2 g3
P 2 2 P 2 2 P 2
cj uj vj 2cb u2b c2j vj2 + vb2 c2j u2j
P P
2 cj uj cj vj 2cb ub vb
= q P
2 c2j u2j
P 2 2 3
cj vj (37)
P 2 2 P 2 2 P 2
cj uj vj u2b c2j vj2 + vb2 c2j u2j
P P
2cb cj uj cj vj ub vb
= q P P 2 2 3
c2j u2j cj vj
Page 30 of 50
M.M. Bendouch Master Thesis
4.6.1 Sampling
To optimize (train) the part-similarity weights w ~ and the relationship weights
q~i we apply stochastic gradient descent (SGD) on the gradient of the similarity
model. The target similarity y {0, 1} is defined as y = I{ user likes item }.
An item is considered liked by a user if that user rates it with a score of 4.5 or
higher and disliked otherwise. This results in an average proportion of 19.12%
liked items, and 20.9 liked items per user (nliked ). We shuffle the order of users in
our dataset and subsequently take the first 1,000 as the test (hold-out) set, the
following 1,000 as the validation set, and the remaining 136,493 as the training
set. The test set is used for evaluation, the train set to optimize the similarity
model, and the validation set to validate the similarity model for early stopping
while training.
An observation is a pair of user-profile and unseen item. User-profiles are
constructed by sampling p = 5 liked items from a user. For each observation
the feature matrices Ui Vi> , Ui Ui> , and Vi Vi> are constructed from the Xi pre-
computed data. The Vi Vi> are retrieved as blocks of Xi , while Ui Vi> and Ui Ui>
are constructed from sums of p blocks, as explained in Section 4.5.2.
For the train and validation sets, the unseen items are defined as all items not
in the user-profile, which is sampled from a random user. For each user-profile,
we sample a liked item or a disliked item with equal probability such that we
obtain balanced train and validation sets with E(y) = 0.5. Each observation
is therefore a random user-profile and item, sampled from a random user. We
sample 100 batches of 1,024 validation observations and 1,374 training batches
of 1,024 observations, for totals of 102,400 and 1,406,976 respectively.
The test set should reflect a realistic recommendation setting, so we sample
the p = 5 user-profile items by shuffling all rated items and then iteratively
discarding the first item, adding it to the user-profile if it is liked. We stop as
soon as we have obtained p = 5 liked items. All discarded liked and disliked
items are then considered to be seen. In other words, we simulate the situation
in time when a recommender detects that a user has liked p = 5 items. We
require the unseen items to contain at least one liked and one disliked item to
Page 31 of 50
M.M. Bendouch Master Thesis
able to measure performance, which leaves us with 809 eligible user-profiles from
the 1,000 test users. We then construct observations for the user-profile with
each unseen item. For each user, we save these in a separate batch. The test
data is therefore composed of 809 batches of varying sizes, namely the number
of unseen items.
4.6.2 Optimization
Stochastic gradient descent (SGD) is a stochastic approximation of gradient de-
scent, a first-order iterative optimization algorithm. Gradient descent finds a
local minimum for an objective function J() by following the steepest descent
in the gradient J(). In each update iteration, the gradient is first estimated
from all observations in the training data, after which a step proportional to the
gradient at the current point is taken to update the parameter vector . The size
of these steps is determined by the learning rate , which has to be determined
by the researcher. The update rule for gradient descent is:
J() (38)
Instead of calculating the gradient over all observations before each update, SGD
approximates the gradient by calculating it over a single observation. It passes
over the observations in the training set one by one and performs a parameter
update after each visited observation. One full pass over the training set is called
an epoch. To avoid cyclical patterns that can hinder convergence, the observa-
tions are shuffled before each epoch. Mini-batch stochastic gradient descent is
a version of this method that divides the training set in batches of fixed size,
and applies the update step to each mini-batch. We hereafter refer to these
mini-batches simply as batches. With a batch size of 1 the method is therefore
equivalent to SGD and with a batch size equal to the training sets size it is
equivalent to standard gradient descent.
Using batches provides the benefit of allowing parallel computation of the
gradients from each observation, which can lead to substantial speed improve-
ments when implemented on parallel computing hardware such as a Graphical
Processing Unit (GPU). This requires a single batch to fit in memory and it is
further common to use a batch size that is a multiple of 16 as this can lead to
efficiency improvements on some hardware. We choose to use batches of 1,024
observations each. As our training set is expected to be too large to load in
memory, we store the batches separately on disk and load them in memory while
training, inspired by [34].
The learning rate has to be chosen by the researcher and there are many
factors involved, so it requires substantial tuning to obtain desirable convergence
behaviour. If is too high, the algorithm keeps oscillating and overshooting the
minima. On the other hand if is too low, it will take too long to descend
Page 32 of 50
M.M. Bendouch Master Thesis
and could insufficiently explore the error surface. To alleviate this, learning rate
annealing can be applied, where the learning rate is initialized at a relatively
high value and is decreased at every epoch. This encourages fast exploration of
the solution space in the earlier epochs while the decreasing learning rate allows
for finer exploitation of the found minimum later on.
Momentum can be also added to prevent oscillations and accelerate conver-
gence [22]. This method remembers the parameter update from the previous
iteration and adds this, multiplied by some factor 0 < < 1 to the update of
the next iteration. The updates therefore have a property similar to physical
momentum as the method creates a tendency to keep the updates moving in the
same direction:
J() + (39)
Other improvements are methods for adaptive learning rates, such as RmsProp
[28] and its updated version Adam [13] that leads to faster convergence in [13].
Adaptive means that the learning rates are scaled differently (adapted) for each
parameter by taking into account the previously observed magnitudes of the
gradient for that parameter. This helps the update steps to adjust for the different
magnitudes of the partial gradients.
The similarity model is trained with SGD on the 1,374 training batches, and
after each epoch the validation error is measured on the 100 validation batches.
If this validation error has improved compared to the last epoch, the current
parameters are saved. Early stopping is activated as soon as the validation error
has not improved over the last 5 epochs. The order of the training batches is
shuffled at every epoch. We anneal the learning rate at every epoch by dividing
the initial learning rate by the number of elapsed epochs. The stored parameters,
those that led to the lowest validation error while training, are subsequently used
for evaluation. The pseudo-code for this procedure is given in Algorithm 1.
4.6.3 Evaluation
We evaluate our method on the 809 test user-profiles sampled according to the
method of Section 4.6.1. We use the trained model to predict the similarity score
for each unseen item in a batch. The comparison between the predicted scores
and the actual likes forms the basis of performance measurement.
i
For each threshold { 500 i (0, 500)} the unseen items for which sim >
are recommended. We can define y = I[sim > ] {0, 1} to indicate this.PFrom
these recommendations,P we can obtain the number of true P positives T P = y y,
false positives F PP= y(1 y), true negatives T N = (1 y)(1 y), and false
negatives F N = y(1 y) for each user. We can then calculate
TP
P recision = (40)
TP + FP
Page 33 of 50
M.M. Bendouch Master Thesis
Algorithm 1 Optimization
1: procedure Optimize(, trainBatches, valBatches)
2: Generate Normal . Random unrestricted
3: GenerateP wi , q~i Exp i [1, k] . Random non-negative
4: q~i q~i / q~i i [1, k] . Sum of q~i restriction
5: {, w, ~ q~1 ..q~k } . Parameters
6: epoch 1 . Epoch counter
7: stop 1 . Stopping counter
8: while stop 6 5 do . Early stopping criterion
9: Shuffle trainBatches
10: t = /epoch . Learning rate annealing
11: for B trainBatches do . Loop over batches
t P1,024
12: 1,024 obs=1 L(, Bobs ) . Gradient descent update
13: end for
14: ` L(, valBatches) . Get average validation loss
15: if epoch = 1 or ` < ` then
16: ` ` . Update best validation loss
17: . Update best parameters
18: stop 1 . Reset stop counter
19: else
20: stop stop + 1 . Increment stop counter
21: end if
22: epoch epoch + 1 . Increment epoch counter
23: end while
24: return . Return best parameters
Page 34 of 50
M.M. Bendouch Master Thesis
Page 35 of 50
M.M. Bendouch Master Thesis
Page 36 of 50
M.M. Bendouch Master Thesis
5 Results
We compare the results for models that use various combinations of feature types
in Eq. 17. For the sake of brevity we refer to the combination of the 5 concept
feature types as C. TF-IDF is the baseline model with terms from plots, and
denoted as T. Our version of SF-IDF+ based on synsets from plots is called S.
VGG19 and VSE are referred to as VG and VS respectively. These abbreviations
referring to one or more feature types are the model components. Each model
can consist of one or more of these components. An overview of the feature types
belonging to each component can be found in Table 5.
Page 37 of 50
M.M. Bendouch Master Thesis
Model k* ** Description
T 1 2 TF-IDF, benchmark.
C 5 18 Modified CF-IDF+.
S 1 21 SF-IDF+, synsets from plots.
C+S 6 38 C and S combined.
VG 1 2 VGG19 synsets, unscaled.
VS 1 2 Visual-Semantic Embeddings, unscaled.
VGL 1 1,002 VGG19 synsets, learned scaling.
VSL 1 1,026 Visual-Semantic Embeddings, learned scaling.
C+S+VG 7 39 C, S, and VG, unscaled.
C+S+VGL 7 1,039 C, S, and VGL , learned scaling.
C+S+VGR 7 39 C, S, and VGR , scaling transferred from VGL .
C+S+VGA 7 39 C, S, and VGA , scaling transferred from VGL .
* Number of feature types (part-similarities) ** Number of parameters.
5.1 Optimization
We implement the optimization procedure in the Python 2.714 library Keras15 ,
which uses a back-end supported by Theano16 . The calculations are performed
on a regular desktop PC equipped with a NVIDIA GTX1060 GPU, which en-
ables efficient parallel computations of the gradient updates in batches of 1,024
observations. The gradients are calculated automatically by Theano using back-
propagation. We minimize the overhead of on-line loading by using a solid-state
drive (SSD) and simultaneously loading the next batch while SGD is applied to
the current batch. Batches are reshuffled before every epoch. For 10 random
restarts, each model is initialized with random weights and is trained with the
optimization and early stopping procedure of 4.6.2 with Adam updates, using
an initial learning rate of = 102 and the other parameters as recommended
by the authors [13]. Together with 1/epoch annealing (Algorithm 1) this shows
stable convergence in our experiments.
To optimize C+S+VGA and C+S+VGR , we have to first optimize the VGL
model, extract the visual scaling from the 10 restarts, and pre-compute the
VGG19 dot-products with this scaling. The optimization results are presented in
Table 7 and we find that training time is within reasonable limits, taking fewer
than 70 minutes for even the heaviest model, C+S+VGL . The impact of our
14
https://www.python.org/download/releases/2.7/
15
https://keras.io/
16
http://deeplearning.net/software/theano/
Page 38 of 50
M.M. Bendouch Master Thesis
5.2 Evaluation
Using the optimized models, we can calculate the performance metrics on our
set of 809 test users. We present the results in Table 8 and it is clear that
Page 39 of 50
M.M. Bendouch Master Thesis
even though we do not directly optimize for these metrics, a lower logloss results
in higher test performance. The p-values of pairwise permutation tests on the
average AUC(ROC) are presented in Table 9.
AU C F1
ROC PR minr maxr minr maxr
C+S+VGA 0.634 0.391 0.435 0.537 0.137 0.298
C+S+VGL 0.624 0.385 0.431 0.531 0.131 0.289
C+S+VGR 0.624 0.386 0.432 0.532 0.128 0.286
C+S+VG 0.574 0.362 0.419 0.510 0.087 0.253
VGL 0.605 0.347 0.429 0.519 0.110 0.262
VSL 0.605 0.370 0.422 0.517 0.115 0.268
VG 0.525 0.308 0.415 0.476 0.036 0.189
VS 0.508 0.299 0.415 0.472 0.018 0.176
C+S 0.570 0.361 0.419 0.509 0.083 0.251
C 0.567 0.358 0.419 0.507 0.081 0.249
T (TF-IDF) 0.535 0.324 0.413 0.479 0.041 0.200
S (SF-IDF+) 0.531 0.319 0.411 0.477 0.038 0.198
Page 40 of 50
M.M. Bendouch Master Thesis
C C C
+ + +
C S S S
+ + + +
VS VG S T C S VG VSL VGL VGL VGT
C+S+VGA * .00 .00 .00 .00 .00 .00 .00 .00 .00 .00 .00
C+S+VGT * .00 .00 .00 .00 .00 .00 .00 .00 .00 .50
C+S+VGL .00 .00 .00 .00 .00 .00 .00 .00 .00 -
VGL .00 .00 .00 .00 .00 .00 .00 .49 -
VSL .00 .00 .00 .00 .00 .00 .00 -
C+S+VG .00 .00 .00 .00 .00 .00 -
C+S .00 .00 .00 .00 .10 -
C .00 .00 .00 .00 -
T (TF-IDF) .00 .07 .12 -
S (SF-IDF+) .00 .21 -
VG .00 -
VS -
*Bonferroni: all comparisons with p < 0.01 also significant at 10 = 0.1%
The synsets small, son, and name are correctly identified from the words lit-
tle, boy, and named respectively. The verb loves is misidentified as a noun even
though it is clearly a verb, but far more damaging is its mapping to the synset
sexual love. This is not what the author had in mind and completely misses the
point. We further see that the word room is mapped to the wrong synset, but
the rest of the words are processed correctly. This example points to low-quality
WSD as the most likely cause of the unexpected under-performance. Although
theoretically disambiguation leads to more accurate semantics, mistakes are also
Page 41 of 50
M.M. Bendouch Master Thesis
amplified. In the above case the ambiguous meanings of the raw terms love and
room would have been preferable, which might explain why TF-IDF performed
slightly better. Nonetheless, the consistently high weights of WordNet relations
shown in Table 10 do demonstrate that related synsets add value.
Concepts alone (C) are more informative than both synsets and terms, as
model C substantially (Table 8) improves over the baseline TF-IDF on all metrics.
When we look at the relation weights for concepts in Table 11, we see that for
all models the directly found concepts have the highest average optimal weight,
as we would expect. But we can infer from the maximum weights in Table
11 that there are a few optimal solutions in which one of other relations holds
most of the weight. The different minima obtain a similar loss but can vary
substantially in parameter estimates. For indirect concepts of the actors class,
the most dominant relation are writers. This suggests that users tend to value
movies with actors who have played a role in movies that involve writers from
their user-profile. Related directors on the other hand receive the highest weight
Page 42 of 50
M.M. Bendouch Master Thesis
when they are found through other directors. Related writers receive little weight
overall, regardless of the relation through which they are found.
C+ C+ C+ C+
C+ S+ S+ S+ S+
C S VG VGL VGA VGR
Actors .268 .774 .772 .271 .932 .360
Actors
.000 .000 .000 .272 .000 .174
Directors
.000 .000 .000 .272 .000 .174
W riters
.700 .011 .152 .417 .003 .398
Directors .697 .643 .666 .741 .602 .620
Actors
Mean .002 .000 .002 .005 .001 .005
Directors
.272 .318 .287 .239 .338 .330
W riters
.030 .039 .046 .016 .059 .045
Writers .880 .886 .764 .888 .898 .888
Actors
.048 .037 .030 .045 .023 .045
Directors
.065 .054 .060 .054 .045 .056
W riters
.008 .023 .147 .013 .034 .012
Actors .911 .909 .907 .923 .942 .936
Actors
.000 .000 .001 .965 .001 .654
Directors
.096 .969 .094 .079 .095 .225
W riters
.985 .019 .971 .977 .005 .980
Directors .710 .685 .721 .762 .704 .834
Actors
Max .004 .001 .014 .008 .006 .016
Directors
.298 .394 .390 .335 .578 .712
W riters
.107 .101 .164 .102 .384 .276
Writers .912 .897 .920 .923 .908 .903
Actors
.057 .056 .055 .054 .026 .081
Directors
.079 .060 .077 .067 .073 .081
W riters
.031 .038 .907 .029 .044 .043
Comparing the visual feature models we see that the unscaled VG outper-
forms VS, which indicates that the 1,000 synset feature values that we ex-
Page 43 of 50
M.M. Bendouch Master Thesis
tract from the posters are more suitable for recommendation than the 1,024-
dimensional visual-semantic embeddings. Optimized scaling results in a large
performance increase: from an AUC(ROC) of 0.508 to 0.605 for VSL and from
0.525 to 0.605 for VGL . Under learned scaling VSL rivals VGL on some metrics,
and closes the gap on AUC(ROC). These results indicate that the visual-semantic
embeddings do not improve recommender performance over the synset vectors.
In constrast to the visual-semantic features, the synsets are interpretable and
we can compare the 1,000 learned scales. We optimize the same scaling in a
full C+S+VGL model as a benchmark to compare C+S+VGA and C+S+VGR
against and estimate the impact of re-using or transferring learned scales across
models. The 10 synsets with the highest scale, on average over 10 restarts in Ta-
ble 12, exhibit some consistency. We find an average correlation of 0.268 (n=45)
for VGL of the visual scaling over the 10 restarts, and a higher 0.486 (n=45)
for the C+S+VGL . Due to the much higher dimensional space compared to the
relation weights, less stability can be expected. Nevertheless, the correlations
indicate that there is some stability across solutions, and this increases when
concepts and synsets are added. Although the number of parameters does not
increase substantially from this addition, there could be interactions or informa-
tion overlap between the feature types that lead to less variation between local
minima. We find an average correlation of the 10 restarts between the two mod-
els of 0.360 (n=55), which increases to 0.768 (n=1) when we compare the two
mean scalings.
VGL C+S+VGL
Fur coat .01198 Book jacket .01587
Stage .00991 Toyshop .01144
Web site .00975 Bow tie .01142
Balloon .00935 Web site .01142
Jigsaw puzzle .00930 Jigsaw puzzle .01000
Volcano .00923 Volcano .00973
Pick .00907 Cinema .00911
Toyshop .00869 Sweatshirt .00887
Cinema .00861 Fountain .00886
Bow tie .00855 Military uniform .00847
When the mean optimized scales of VGL are transferred to the C+S+VGR
model, it strongly outperforms its unscaled version C+S+VG and all other rec-
ommenders without learned scaling. The performance is indistinguishable from
Page 44 of 50
M.M. Bendouch Master Thesis
that of C+S+VGL , which can be considered a good benchmark for the learned
scaling because this model optimized the scaling together with the rest of the
model. In our test environment therefore, re-using the visual feature scales in
a different model does not seem to decrease performance. When we collect the
average VG scale over 10 random restarts of VGL and transfer this to C+S+VGA
we see that it strongly outperforms all other models (Table 8) and we can reject
the null hypothesis that one or more of the other models have an equal average
AUC(ROC) at the 1% level. C+S+VGA outperforms the traditional benchmark
TF-IDF by a large margin on all metrics. Average AUC(ROC) improves from
0.531 to 0.634, and AUC(PR) from 0.324 to 0.391. We improve minr (F1 ) from
0.413 to 0.435, and maxr (F1 ) from 0.479 to 0.537. Kappa metrics are improved
from 0.038 to 0.137 and from 0.198 to 0.298 for minr () and maxr () respectively.
Given the separately pre-trained visual scaling, we can optimize the model with
the scalable approach using pre-computed dot-products in 4-5 minutes. It is nei-
ther necessary to train the scaling together with the model as a whole, nor to
directly optimize on the final performance metrics.
Page 45 of 50
M.M. Bendouch Master Thesis
6 Conclusion
In this thesis we proposed an extension to previous works on semantics-driven
recommenders. We demonstrated that these systems are broadly applicable be-
yond news recommendations, and the complex domain of movies is one example.
We found that rich semantic information can be extracted not just from articles,
but item descriptions in a much broader sense, even raw images. When a suit-
able domain ontology is unavailable or incompatible with the recommendation
system, our virtual ontology method for finding related concepts can be applied
directly and only requires the dataset itself. Compared to state-of-the-art visual-
semantic methods, synset based visual feature extraction turned out to be more
interpretable and achieved similar performance. In situations where the proper
scaling method for feature vectors is unknown, we showed that effective scales
can be found through direct optimization of the logloss. We further provided evi-
dence that these learned scales can be transferred to other models without having
to re-optimize them. Through a reformulation of how related features are com-
bined, we were able to pre-compute the computationally expensive operations of
the cosine similarities and reduced the dimensionality of the similarity model by
several orders of magnitude. The semantics-driven recommender we presented
strongly outperformed the benchmark TF-IDF on ROC, P R, F1 , and , even
though it was not directly optimized on these metrics but on a cross-entropy
loss function that allowed for efficient gradient-based methods. The proposed
scaling-up of the semantics-driven approach has allowed us to optimize these
models within minutes on consumer-grade commodity hardware.
This research highlights that semantics-driven recommenders have many un-
explored applications and can be utilized effectively with the proposed approach,
opening the door to further extensions to other domains. There is potential for
synsets to contribute more information with an improved word sense disambigua-
tion method. The visual synsets extracted from the posters do not have to be
disambiguated but can perhaps be augmented with related synsets from Word-
Net. The convincing success of learned feature scaling introduces the possibility
of models with greater degrees of freedom, especially since the short training
time on commodity hardware means that still larger datasets can be utilized.
From our dataset, many more user-profiles and training samples could have been
sampled. More semantic information in the form of biographies, a wide variety of
images/posters, reviews, or from external ontologies such as DBpedia17 could be
integrated within the large capacity of the proposed framework. And although we
have used the most important semantic variables, we do not rule out the potential
(domain) semantics left in some of the discarded data, such as the country of ori-
gin. The capacity to train more parameters could be spent on learning separate
weights or scaling for lexical categories such as nouns/verbs/adverbs/adjectives
17
http://wiki.dbpedia.org/
Page 46 of 50
M.M. Bendouch Master Thesis
for the synsets. Related synsets and concepts are found through direct connec-
tions only now, but multi-step paths in either WordNet or the domain ontology
fits in the proposed framework and can add additional (domain) semantics. When
concepts are extracted as variables instead of from text, IDF scaling is not an
obvious choice. Relevant questions about scaling remain, as we demonstrated the
large impact of optimal scaling, which has not been tested for concepts or synsets
from texts. The benefits of allowing a concept to be in multiple classes, such as
a person who is a director of one movie and an actor in another, is also left to
explore. When the number of parameters is increased to the point of overfit,
tools such as regularization, noise injection, or random sub-sampling of features
can be employed to further increase model capacity.
Page 47 of 50
M.M. Bendouch Master Thesis
References
[1] Satanjeev Banerjee and Ted Pedersen. An Adapted Lesk Algorithm for Word
Sense Disambiguation Using WordNet, pages 136145. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2002.
[2] Michel Capelle, Flavius Frasincar, Marnix Moerland, and Frederik Hogen-
boom. Semantics-based News Recommendation. In Proceedings of the
2nd International Conference on Web Intelligence, Mining and Semantics,
WIMS 2012, New York, NY, USA, 2012. ACM.
[3] Michel Capelle, Marnix Moerland, Frederik Hogenboom, Flavius Frasincar,
and Damir Vandic. Bing-SF-IDF+: A Hybrid Semantics-Driven News Rec-
ommender. In Proceedings of the 2015 ACM Symposium on Applied Com-
puting, SAC 2015, New York, NY, USA, 2015. ACM.
[4] Dan C. Ciresan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, and
Jurgen Schmidhuber. Flexible, high performance convolutional neural net-
works for image classification. In Proceedings of the Twenty-Second Interna-
tional Joint Conference on Artificial Intelligence - Volume Two, IJCAI11,
pages 12371242. AAAI Press, 2011.
[5] D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural net-
works for image classification. In 2012 IEEE Conference on Computer Vision
and Pattern Recognition, pages 36423649, June 2012.
[6] Jacob Cohen. A Coefficient of Agreement for Nominal Scales. Educational
and Psychological Measurement, 20(1):3746, 1960.
[7] Emma de Koning. News Recommendation with CF-IDF+. B.S. Thesis,
Erasmus University Rotterdam, July 2015.
[8] Jean Dunn and Olive Jean Dunn. Multiple Comparisons Among Means.
American Statistical Association, pages 5264, 1961.
[9] M. Egmont-Petersen, D. de Ridder, and H. Handels. Image Processing with
Neural Networksa review. Pattern Recognition, 35(10):22792301, 2002.
[10] Sachin Sudhakar Farfade, Mohammad J. Saberian, and Li-Jia Li. Multi-
view Face Detection Using Deep Convolutional Neural Networks. CoRR,
abs/1502.02766, 2015.
[11] Frank Goossen, Wouter IJntema, Flavius Frasincar, Frederik Hogenboom,
and Uzay Kaymak. News Personalization Using the CF-IDF Semantic Rec-
ommender. In Proceedings of the 1st International Conference on Web In-
telligence, Mining and Semantics, WIMS 2011, New York, NY, USA, 2011.
ACM.
Page 48 of 50
M.M. Bendouch Master Thesis
[12] Karen Sparck Jones. A Statistical Interpretation of Term Specificity and its
Application in Retrieval. Journal of Documentation, 28(1):1121, 1972.
[13] Diederik P. Kingma and Jimmy Ba. Adam: a Method for Stochastic Opti-
mization. CoRR, abs/1412.6980, 2014.
[14] Ryan Kiros, Ruslan Salakhutdinov, and Richard S. Zemel. Unifying Visual-
Semantic Embeddings with Multimodal Neural Language models. CoRR,
abs/1411.2539, 2014.
[17] Masakazu Matsugu, Katsuhiko Mori, Yusuke Mitari, and Yuji Kaneda. Sub-
ject Independent Facial Expression Recognition with Robust Face Detection
using a Convolutional Neural Network. Neural Networks, 16(56):555 559,
2003. Advances in Neural Networks Research: {IJCNN} 03.
[18] Marnix Moerland, Frederik Hogenboom, Michel Capelle, and Flavius Fras-
incar. Semantics-based News Recommendation with SF-IDF+. In Proceed-
ings of the 3rd International Conference on Web Intelligence, Mining and
Semantics, WIMS 2013, New York, NY, USA, 2013. ACM.
[21] Amir Hossein Nabizadeh Rafsanjani, Naomie Salim, Atae Rezaei Aghdam,
and Karamollah Bagheri Fard. Recommendation Systems: A Review. Inter-
national Journal of Computational Engineering Research, 3(5):4752, 2013.
[23] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S.
Page 49 of 50
M.M. Bendouch Master Thesis
Bernstein, Alexander C. Berg, and Fei-Fei Li. Imagenet Large Scale Vi-
sual Recognition Challenge. CoRR, abs/1409.0575, 2014.
[24] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks
for Large-scale Image Recognition. CoRR, abs/1409.1556, 2014.
[25] SINTEF. Big Data, for Better or Worse: 90% of worlds data gen-
erated over last two years. www.sciencedaily.com/releases/2013/05/
130522085217.htm.
[26] SINTEF. The PNG image file format is now more popular than
GIF, 2013-01. https://w3techs.com/blog/entry/the_png_image_file_
format_is_now_more_popular_than_gif.
[27] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Ra-
binovich. Going Deeper with Convolutions. CoRR, abs/1409.4842, 2014.
[28] T. Tieleman and G. Hinton. Lecture 6.5RmsProp: Divide the gradient by
a running average of its recent magnitude. COURSERA: Neural Networks
for Machine Learning, 2012.
[29] C.J. Van Rijsbergen, S.E. Robertson, and M.F. Porter. New Models in
Probabilistic Information Retrieval. British Library research & development
report. Computer Laboratory, University of Cambridge, 1980.
[30] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show
and Tell: A Neural Image Caption Generator. In The IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015.
[31] Tim Vos. On the Recommendation of News Using the CF-IDF+ Recom-
mender. B.S. Thesis, Erasmus University Rotterdam, July 2015.
[32] P. D. Wasserman and T. Schwartz. Neural networks. II. What are they and
why is everybody so interested in them now? IEEE Expert, 3(1):1015,
Spring 1988.
[33] Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. A Survey of
Transfer Learning. Journal of Big Data, 3(1):9, 2016.
[34] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large
linear classification when data cannot fit in memory. In Proceedings of the
16th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD 10, pages 833842, New York, NY, USA, 2010. ACM.
[35] Guo-Qing Zhang, Guo-Qiang Zhang, Qing-Feng Yang, Su-Qi Cheng, and
Tao Zhou. Evolution of the Internet and its Cores. New Journal of Physics,
10(12):123027, 2008.
Page 50 of 50