Académique Documents
Professionnel Documents
Culture Documents
30/04/17 1
Agenda
Found
ation
Juras-
sic
Park
Machine User
Lost
Learning Profile
World
Differ-
ence
Engine
30/04/17 4
Personalization
Recommenders are instances of personalization
software.
Personalization concerns adapting to the individual
needs, interests, and preferences of each user.
30/04/17 5
Machine Learning and
Personalization
Machine Learning can allow learning a user
model or profile of a particular user based on:
Sample interaction
Rated examples
This model or profile can then be used to:
Recommend items
Filter information
Predict behavior
30/04/17 6
Collaborative Filtering
Maintain a database of many users ratings of a
variety of items.
For a given user, find other similar users whose
ratings strongly correlate with the current user.
Recommend items rated highly by these similar users,
but not rated by the current user.
Almost all existing commercial recommenders use
this approach (e.g. Amazon).
30/04/17 7
Collaborative Filtering
A 9 A A 5 A A 6 A 10
User B 3 B B 3 B B 4 B 4
C C 9 C C 8 C C 8
Database : : : : : : : : : : . .
Z 5 Z 10 Z 7 Z Z Z 1
A 9 A 10
B 3 B 4
Correlation C C 8
Match : : . .
Z 5 Z 1
A 9
Active B 3 Extract C
C Recommendations
User . .
30/04/17 8
Z 5
Collaborative Filtering Method
Weight all users with respect to similarity with
the active user.
Select a subset of the users (neighbors) to use
as predictors.
Normalize ratings and compute a prediction
from a weighted combination of the selected
neighbors ratings.
Present items with highest predicted ratings as
recommendations.
30/04/17 9
Types of Collaborative Filtering (CF)
Memory based approaches (covered)
User-based CF
Item-based CF
Model-based approaches (will cover later)
Clustering
30/04/17 10
Advantages with Collaborative Filtering
Simple algorithm
Use wisdom of crowd
Can recommend out of the box
High quality recommendations (high accuracy)
30/04/17 11
Problems with Collaborative Filtering
Cold Start: There needs to be enough other users already in
the system to find a match.
Sparsity: If there are many items to be recommended, even if
there are many users, the user/ratings matrix is sparse, and it is
hard to find users that have rated the same items.
First Rater: Cannot recommend an item that has not been
previously rated.
New items
Popularity Bias: Cannot recommend items to someone with
unique tastes.
Gray-Sheep Users.
30/04/17 12
Content-Based Recommending
(CBR)
Recommendations are based on information on the
content of items rather than on other users opinions.
Uses a machine learning algorithm to induce a profile
of the users preferences from examples based on a
featural description of content.
Some previous applications:
Newsweeder (Lang, 1995)
30/04/17 13
Advantages of CBR
No cold-start or sparsity problems.
Able to recommend to users with unique tastes.
Able to recommend new and unpopular items
No first-rater problem.
Can provide explanations of recommended items by
listing content-features that caused an item to be
recommended.
30/04/17 14
Disadvantages of CBR
Requires content that can be encoded as meaningful
features (problem for multimedia data).
Users tastes must be represented as a learnable
function of these content features.
Low quality recommendation (low accuracy)
Unable to exploit quality judgments of other users.
Can not recommend out of the box
30/04/17 15
Content-Based Recommending
Vs collaborating Filtering
Content Based Filtering Collaborating Filtering
30/04/17 16
Presentations
Will start from Nov
18, 2013
Keep an eye on
email
10% grades
Who is crying?
Rated
Examples Machine Learning
Learner
Recommendations
1.~~~~~~ User Profile
2.~~~~~~~
3.~~~~~
:
:
: Predictor
*Not part of course (only for presentations)
30/04/17 18
Combining Content and
Collaboration [1]
Content-based and collaborative methods have
complementary strengths and weaknesses.
Combine methods to obtain the best of both.
Various hybrid approaches:
Apply both methods and combine recommendations.
Use collaborative data as content.
Use content-based predictor as another collaborator.
Use content-based predictor to complete collaborative
data.
Movie
Content
Database
Active Collaborative
User Ratings Filtering
Recommendations
Training Examples
Content-Based
Predictor
ROC sensitivity
How well predictions help users select high-quality items
Ratings 4 considered good; < 4 considered bad
1.06
1.04
1.02
1 CF
MAE
0.98 Content
0.96 Nave
0.94 CBCF
0.92
0.9
Algorithm
0.68
0.66
0.64 CF
ROC-4
Content
0.62 Nave
CBCF
0.6
0.58
Algorithm
30/04/17 27
Information Retrieval (IR)
The indexing and retrieval of textual documents.
killer app.
Concerned firstly with retrieving relevant documents to a
query.
Concerned secondly with retrieving from large sets of
documents efficiently.
30/04/17 28
Typical IR Task
Given:
A corpus of textual natural-language documents.
A user query in the form of a textual string.
Find:
A ranked set of documents that are relevant to the
query.
30/04/17 29
IR System
Document
corpus
Query IR
String System
1. Doc1
2. Doc2
Ranked 3. Doc3
.
Documents .
30/04/17 30
Relevance
Relevance is a subjective judgment and may
include:
Being on the proper subject.
Being timely (recent information).
Being authoritative (from a trusted source).
Satisfying the goals of the user and his/her
intended use of the information (information
need).
30/04/17 31
Keyword Search
Simplest notion of relevance is that the query
string appears verbatim in the document.
Slightly less strict notion is that the words in
the query appear frequently in the document,
in any order (bag of words).
30/04/17 32
Problems with Keywords
May not retrieve relevant documents that
include synonymous terms.
restaurant vs. caf
PRC vs. China
May retrieve irrelevant documents that
include ambiguous terms.
bat (baseball vs. mammal)
Apple (company vs. fruit)
bit (unit of data vs. act of eating)
30/04/17 33
Beyond Keywords
We will cover the basics of keyword-based
IR, but
We will focus on basic capabilities of IR
30/04/17 34
Intelligent IR
Taking into account the meaning of the words used.
Taking into account the order of words in the query.
Adapting to the user based on direct or indirect
feedback.
Taking into account the authority of the source
(Loophole, the Google cracked in!).
30/04/17 35
Web Search
Application of IR to HTML documents on the
World Wide Web.
Differences:
Must assemble document corpus by spidering
the web.
Documents change uncontrollably.
Can exploit the link structure of the web.
30/04/17 36
Web Search System
Web Spider Document
corpus
Query IR
String System
1. Page1
2. Page2
3. Page3
Ranked
. Documents
.
30/04/17 37
Other IR-Related Tasks
Automated document categorization
Information filtering (spam filtering)
Information routing
Automated document clustering
Recommending information or products
Information extraction
Information integration
Question answering
30/04/17 38
30/04/17 39
Example
Subject: US-TN-SOFTWARE PROGRAMMER
Date: 17 Nov 1996 17:37:29 GMT
Organization: Reference.Com Posting Service
Message-ID: <56nigp$mrs@bilbo.reference.com>
SOFTWARE PROGRAMMER
30/04/17 40
Steps involved [1]
Tokenization
Stop word removal
Stemming
Indexing
Weighting schemes
Similarity Measure
Vector Space Model
30/04/17 41
Tokenization
Sequence of discrete tokens (words).
Punctuation (e-mail), numbers (2009), and case
(MobileVCE vs. mobileVCE) can be a meaningful
part of a token.
Keywords
Simplest approach
Ignore all numbers
Ignore all punctuations
Ignore all cases
30/04/17 42
Stop Words
It is typical to exclude high-frequency words (e.g.
function words: a, the, in, to; pronouns: I, he,
she, it).
Language dependent.
Google stop words (
www.ranks.nl/resources/stopwords.html)
Customizable domain dependent lists
30/04/17 43
Stemming
Reduce tokens to root form of words to recognize
morphological variation.
computer, computational, computation all reduced to same
token compute
Correct morphological analysis is language specific
and can be complex.
Stemming strips off known affixes (prefixes and
suffixes) in an iterative fashion.
Example: Porter stemmer algorithm
30/04/17 44
Indexing
Dj, tfj
Index terms df
computer 3 D7, 4
database 2 D1, 3
science 4 D2, 4
system 1 D5, 2
30/04/17 45
Term Weights: Term Frequency
30/04/17 46
Term Weights:
Inverse Document Frequency
Terms that appear in many different documents are less
indicative of overall topic.
df i = document frequency of term i
= number of documents containing term i
idfi = inverse document frequency of term i,
= log2 (N/ df i)
(N: total number of documents)
An indication of a terms discrimination power.
30/04/17 47
TF-IDF Weighting
A typical combined term importance indicator is tf-idf
weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
30/04/17 48
Similarity Measure
A similarity measure is a function that computes the
degree of similarity between two vectors.
30/04/17 49
The Vector-Space Model
Assume t distinct terms remain after preprocessing; call them
index terms or the vocabulary.
These orthogonal terms form a vector space.
Dimension = t = |vocabulary|
Each term, i, in a document or query, j, is given a real-valued
weight, wij.
Both documents and queries are expressed as t-
dimensional vectors:
dj = (w1j, w2j, , wtj)
30/04/17 50
Graphic Representation
Example:
D1 = 2T1 + 3T2 + 5T3 T3
D2 = 3T1 + 7T2 + T3
5
Q = 0T1 + 0T2 + 2T3
T1 T2 . Tt
D1 w11 w21 wt1
D2 w12 w22 wt2
: : : :
: : : :
Dn w1n w2n wtn
30/04/17 52
TF-IDF Weighting
A typical combined term importance
indicator is tf-idf weighting:
wij = tfij idfi = tfij log2 (N/ dfi)
30/04/17 53
Computing TF-IDF -- An Example
Given a document containing terms with given frequencies:
A(3), B(2), C(1)
Assume collection contains 10,000 documents and
document frequencies of these terms are:
A(50), B(1300), C(250)
Then:
A: tf = 3/3; idf = log2(10000/50) = 7.6; tf-idf = 7.6
B: tf = 2/3; idf = log2 (10000/1300) = 2.9; tf-idf = 2.0
C: tf = 1/3; idf = log2 (10000/250) = 5.3; tf-idf = 1.8
30/04/17 55
Similarity Measure - Inner Product
Similarity between vectors for the document di and query q can
be computed as the vector inner product (a.k.a. dot product):
t
where wij is the weight of term i in document j and wiq is the weight of
term i in the query
30/04/17 56
Inner Product -- Examples
ure r ent ion
al ase tect ute em at
v
ie tab chi p xt nag form
r
Binary: t m
re da ar co te ma in
D = 1, 1, 1, 0, 1, 1, 0 Size of vector = size of vocabulary = 7
Q = 1, 0 , 1, 0, 0, 1, 1 0 means corresponding term not
found in document or query
sim(D, Q) = 3
Weighted:
D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3
Q = 0T1 + 0T2 + 2T3
30/04/17 57
Cosine Similarity Measure
Cosine similarity measures the cosine of the angle t3
between two vectors.
Inner product normalized by the vector lengths.
t D1
dj q
( wij wiq ) Q
CosSim(dj, q) = i 1
t t
t1
dj q 2
wij wiq
2
i 1 i 1
t2 D2
30/04/17 58
Nave Implementation for Information
Retrieval of some documents (Pseudo code)
1. Convert all documents in collection D to tf-idf weighted
vectors, dj, for keyword vocabulary V.
2. Convert query to a tf-idf-weighted vector q.
3. For each dj in D do
Compute score sj = cosSim(dj, q)
1. Sort documents by decreasing score.
2. Present top ranked documents to the user.
30/04/17 60
Problems with Vector Space Model
Missing semantic information (e.g. word sense).
Missing syntactic information (e.g. phrase structure, word
order, proximity information).
Assumption of term independence (e.g. ignores synonomy).
30/04/17 61
References
1. https://www.cs.utexas.edu/~ml/publications/area/119/learning_fo
r_recommender_systems/abstracts/
2. http://www.cs.utexas.edu/~mooney/ir-course/
30/04/17 62