Académique Documents
Professionnel Documents
Culture Documents
Collaborative recommendation
The basic idea of these systems is that if users shared the same interests in the past if they viewed
or bought the same books, for instance they will also have similar tastes in the future. So, if, for
example, user A and user B have a purchase history that overlaps strongly and user A has recently
bought a book that B has not yet seen, the basic rationale is to propose this book also to B.
Because this selection of hopefully interesting books involves filtering the most promising ones
from a large set and because the users implicitly collaborate with one another, this technique is also
called collaborative filtering (CF).
Today, systems of this kind are in wide use and have also been extensively studied over the past
fifteen years. We cover the underlying techniques and open questions associated with collaborative
filtering in detail in the next chapter of this book. Typical questions that arise in the context of
collaborative approaches include the following:
How do we find users with similar tastes to the user for whom we need a recommendation?
How do we measure similarity?
What should we do with new users, for whom a buying history is not yet available?
How do we deal with new items that nobody has bought yet?
What if we have only a few ratings that we can exploit?
What other techniques besides looking for similar users can we use for making a prediction
about whether a certain user will like an item?
Pure CF approaches do not exploit or require any knowledge about the items themselves.
Continuing with the bookstore example, the recommender system, for instance, does not need to
know what a book is about, its genre, or who wrote it.
Content-based recommendation
In general, recommender systems may serve two different purposes.
1. they can be used to stimulate users into doing something such as buying a specific book or
watching a specific movie.
2. Recommender systems can also be seen as tools for dealing with information overload, as these
systems aim to select the most interesting items from a larger set.
Thus, recommender systems research is also strongly rooted in the fields of information retrieval
and information filtering. In these areas, however, the focus lies mainly on the problem of
discriminating between relevant and irrelevant documents (as opposed to the artifacts such as books
or digital cameras recommended in traditional e-commerce domains).
At its core, content-based recommendation is based on the availability of (manually created or
automatically extracted) item descriptions and a profile that assigns importance to these
characteristics. If we think again of the book- store example, the possible characteristics of books
might include the genre, the specific topic, or the author. Similar to item descriptions, user profiles
may also be automatically derived and learned either by analyzing user behavior and feedback or
by asking explicitly about interests and preferences.
In the context of content-based recommendation, the following questions must be answered:
How can systems automatically acquire and continuously improve user profiles?
How do we determine which items match, or are at least similar to or compatible with, a
users interests?
What techniques can be used to automatically extract or learn the item descriptions to reduce
manual annotation?
Pure collaborative approaches take a matrix of given useritem ratings as the only input and
typically produce the following types of output: (a) a (numerical) prediction indicating to what
degree the current user will like or dislike a certain item and (b) a list of n recommended items.
Such a top-N list should, of course, not contain items that the current user has already bought.
User-based nearest neighbor recommendation
The first approach we discuss here is also one of the earliest methods, called user-based nearest
neighbor recommendation. The main idea is simply as follows: given a ratings database and the ID
of the current (active) user as an input, identify other users (sometimes referred to as peer users or
nearest neighbors) that had similar preferences to those of the active user in the past.
Then, for every product p that the active user has not yet seen, a prediction is computed based on
the ratings for p made by the peer users. The underlying assumptions of such methods are that (a) if
users had similar tastes in the past they will have similar tastes in the future and (b) user preferences
remain stable and consistent over time.
The Pearson correlation coefficient takes values from +1 (strong positive correlation) to 1 (strong
negative correlation).
2.1.2 Better similarity and weighting metrics
In the example, we used Pearsons correlation coefficient to measure the similarity among users. In
the literature, other metrics, such as adjusted cosine similarity (which will be discussed later in
more detail), Spearmans rank correlation coefficient, or the mean squared difference measure have
also been proposed to determine the proximity between users. However, empirical analyses show
that for user-based recommender systems and at least for the best studied recommendation
domains the Pearson coefficient outperforms other measures of comparing users (Herlocker et al.
1999). For the later-described item-based recommendation techniques, however, it has been
reported that the cosine similarity measure consistently outperforms the Pearson correlation metric.
2.1.3 Neighborhood selection
The value chosen for k the size of the neighborhood does not influence coverage. However, the
problem of finding a good value for k still exists: When the number of neighbors k taken into
account is too high, too many neighbors with limited similarity bring additional noise into the
predictions.
When k is too small for example, below 10 in the experiments from Herlocker et al. (1999) the
quality of the predictions may be negatively affected. An analysis of the MovieLens dataset
indicates that in most real-world situations, a neighborhood of 20 to 50 neighbors seems
reasonable (Herlocker et al. 2002).
2.2 Item-based nearest neighbor recommendation
Although user-based CF approaches have been applied successfully in different domains, some
serious challenges remain when it comes to large e-commerce sites, on which we must handle
millions of users and millions of catalog items.
In particular, the need to scan a vast number of potential neighbors makes it impossible to compute
predictions in real time. Large-scale e-commerce sites, therefore, often implement a different
technique, item-based recommendation, which is more apt for offline preprocessing 1 and thus
allows for the computation of recommendations in real time even for a very large rating matrix
(Sarwar et al. 2001).
The main idea of item-based algorithms is to compute predictions using the similarity between
items and not the similarity between users.
A first, and very simple, way to implement collaborative filtering with a probabilistic method is to
view the prediction problem as a classification prob- lem, which can generally be described as the
task of assigning an object to one of several predefined categories (Tan et al. 2006). As an
example of a classification problem, consider the task of classifying an incoming e-mail message as
spam or non-spam. In order to automate this task, a function has to be developed that defines
based, for instance, on the words that occur in the message header or content whether the message
is classified as a spam e-mail or not. The classification task can therefore be seen as the problem of
learning this mapping function from training examples. Such a function is also informally called the
classification model.
One standard technique also used in the area of data mining is based on Bayes classifiers. We show,
with a simplified example, how a basic probabilistic method can be used to calculate rating
predictions. Consider a slightly different
that rated items, but also because most of todays most successful online recommenders rely on these techniques. Early systems were built using memorybased neighborhood and correlation-based algorithms. Later, more complex
and model-based approaches were developed that, for example, apply techniques from various fields, such as machine learning, information retrieval, and
data mining, and often rely on algebraic methods such as SVD.
In the field of Information Retrieval, the main technology that you find is Search Engines.
Recommender Systems are the main technology that you find in the field of Information
Filtering.
A more appropriate question would therefore be what is the difference between Information
Retrieval and Information Filtering or what is the difference between Search Engines and
Recommender Systems.
Traditionally though, I'd say that information retrieval is closer linked to search (i.e. retrieving
information given a query), whereas recommendation usually bypasses the query - the IR research
community sometimes calls this query-less search.
Recommender system
Recommender systems or recommendation systems (sometimes replacing "system" with a
synonym such as platform or engine) are a subclass of information filtering system that seek to
predict the "rating" or "preference" that a user would give to an item.[1][2]
Recommender systems have become extremely common in recent years, and are utilized in a
variety of areas: some popular applications include movies, music, news, books, research articles,
search queries, social tags, and products in general. There are also recommender systems for
experts,[3] collaborators,[4] jokes, restaurants, garments, financial services,[5] life insurance,
romantic partners (online dating), and Twitter pages.[6]
Machine learning
Machine learning is the subfield of computer science that "gives computers the ability to learn
without being explicitly programmed" (Arthur Samuel, 1959).[1] Evolved from the study of pattern
recognition and computational learning theory in artificial intelligence,[2] machine learning
explores the study and construction of algorithms that can learn from and make predictions on
data[3] such algorithms overcome following strictly static program instructions by making datadriven predictions or decisions,[4]:2 through building a model from sample inputs. Machine
learning is employed in a range of computing tasks where designing and programming explicit
algorithms is unfeasible; example applications include spam filtering, optical character recognition
(OCR),[5] search engines and computer vision.
Machine learning is closely related to (and often overlaps with) computational statistics, which also
focuses in prediction-making through the use of computers. It has strong ties to mathematical
optimization, which delivers methods, theory and application domains to the field. Machine
learning is sometimes conflated with data mining,[6] where the latter subfield focuses more on
exploratory data analysis and is known as unsupervised learning.[4]:vii[7]
Within the field of data analytics, machine learning is a method used to devise complex models and
algorithms that lend themselves to prediction; in commercial use, this is known as predictive
analytics. These analytical models allow researchers, data scientists, engineers, and analysts to
"produce reliable, repeatable decisions and results" and uncover "hidden insights" through learning
from historical relationships and trends in the data.[8]
Step
4:
Getting
Python.
Machine
Numpy.
Pandas.
Matplotlib. Check.
Started
learning
with
Machine
Learning
Check.
fundamentals.
Check.
Check.
in
Python
Check.
The time has come. Let's start implementing machine learning algorithms with Python's de facto
standard machine learning library, scikit-learn.
Many of the following tutorials and exercises will be driven by the iPython (Jupyter) Notebook,
which is an interactive environment for executing Python. These iPython notebooks can optionally
be viewed online or downloaded and interacted with locally on your own computer.
iPython
Notebook
Overview
from
Stanford
Also note that the tutorials below are from a number of online sources. All Notebooks have been
attributed to the authors; if, for some reason, you find that someone has not been properly credited
for their work, please let me know and the situation will be rectified ASAP. In particular, I would
like to tip my hat to Jake VanderPlas, Randal Olson, Donne Martin, Kevin Markham, and Colin
Raffel
for
their
fantastic
freely-available
resources.
Our first tutorials for getting our feet wet with scikit-learn follow. I suggest doing all of these in
order
before
moving
to
the
following
steps.
A general introduction to scikit-learn, Python's most-used general purpose machine learning library,
covering
the
k-nearest
neighbors
algorithm:
An
Introduction
to
scikit-learn
by
Jake
VanderPlas
A more in-depth and expanded introduction, including a starter project with a well-known dataset
from
start
to
finish:
Example
Machine
Learning
Notebook
by
Randal
Olson
A focus on strategies for evaluating different models in scikit-learn, covering train/test dataset
splits:
Model
Evaluation
Step
5:
by
Machine
Kevin
Learning
Markham
Topics
with
Python
With a foundation having been laid in scikit-learn, we can move on to some more in-depth
explorations of the various common, and useful, algorithms. We start with k-means clustering, one
of the most well-known machine learning algorithms. It is a simple and often effective method for
solving
unsupervised
learning
problems:
k-means
Clustering
by
Jake
VanderPlas
Next, we move back toward classification, and take a look at one of the most historically popular
classification
methods:
Decision
From
Trees
classification,
Linear
via
we
The
look
Regression
at
Grimm
continuous
by
Scientist
numeric
Jake
prediction:
VanderPlas
We can then leverage regression for classification problems, via logistic regression:
Logistic
Step
6:
Regression
Advanced
by
Machine
Kevin
Learning
Markham
Topics
with
Python
We've gotten our feet wet with scikit-learn, and now we turn our attention to some more advanced
topics. First up are support vector machines, a not-necessarily-linear classifier relying on complex
transformations
of
data
into
higher
dimensional
space.
Support
Vector
Machines
by
Jake
VanderPlas
Next, random forests, an ensemble classifier, are examined via a Kaggle Titanic Competition walkthrough:
Kaggle
Titanic
Competition
(with
Random
Forests)
by
Donne
Martin
Dimensionality reduction is a method for reducing the number of variables being considered in a
problem. Principal Component Analysis is a particular form of unsupervised dimensionality
reduction:
Dimensionality
Reduction
by
Jake
VanderPlas
Before moving on to the final step, we can take a moment to consider that we have come a long
way
in
a
relatively
short
period
of
time.
Using Python and its machine learning libraries, we have covered some of the most common and
well-known machine learning algorithms (k-nearest neighbors, k-means clustering, support vector
machines), investigated a powerful ensemble technique (random forests), and examined some
additional machine learning support tasks (dimensionality reduction, model validation techniques).
Along with some foundational machine learning skills, we have started filling a useful toolkit for
ourselves.
We
will
add
one
more
in-demand
tool
to
that
kit
before
wrapping
up.
Networks
and
Deep
Learning
by
Michael
Nielsen
Theano
Theano is the first Python deep learning library we will look at. From the authors:
Theano is a Python library that allows you to define, optimize, and evaluate
mathematical expressions involving multi-dimensional arrays efficiently.
The following introductory tutorial on deep learning in Theano is lengthy, but it is quite good, very
descriptive, and heavily-commented:
Theano Deep Learning Tutorial by Colin Raffel
Caffe
The other library we will test drive is Caffe. Again, from the authors:
Caffe is a deep learning framework made with expression, speed, and modularity in
mind. It is developed by the Berkeley Vision and Learning Center (BVLC) and by
community contributors.
This tutorial is the cherry on the top of this article. While we have undertaken a few interesting
examples above, none likely compete with the following, which is implementing Google's
#DeepDream using Caffe. Enjoy this one! After understanding the tutorial, play around with it to
get your processors dreaming on their own.
Dreaming Deep with Caffe via Google's GitHub
I didn't promise it would be quick or easy, but if you put the time in and follow the above 7 steps,
there is no reason that you won't be able to claim reasonable proficiency and understanding in a
number of machine learning algorithms and their implementation in Python using its popular
libraries, including some of those on the cutting edge of current deep learning research.
Bio: Matthew Mayo is a computer science graduate student currently working on his thesis
parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and
an aspiring machine learning scientist.