Académique Documents
Professionnel Documents
Culture Documents
An Overview of
• Limitations of Database systems (Motivation for
Information Retrieval IR systems)
• Information Retrieval
– Indexing
Nov. 10, 2009
– Similarity Measures
– Evaluation
Maryam Karimzadehgan – Other IR applications
mkarimz2@illinois.edu
• Web Search
- PageRank Algorithm
Department of Computer Science
University of Illinois, Urbana-Champaign
• News Recommender system on Facebook
Representation Representation
Comparison
Index
Function
Results Results
11/10/2009 Introduction to Information Retrieval 9 11/10/2009 Introduction to Information Retrieval 10
Slide is from Jimmy Lin’s tutorial Slide is from Jimmy Lin’s tutorial
This is another 2 1
• Stop words are words such as:
sample document 2 1
– a, about, above, according, across, after, afterwards,
… …
again, against, albeit, all, almost, alone, already,
… …
also, although, always, among, as, at
• Disadvantages
– No notion of ranking (exact matching only)
– All index terms have equal weight
…
? Which model would most
likely have generated
this query?
• Retrieval problem ≈ Estimation of p(wi|d)
Food nutrition food ?
nutrition ?
• Smoothing is an important issue, and
paper healthy ? distinguishes different approaches
diet ?
…
11/10/2009 21 11/10/2009 Introduction to Information Retrieval 22
Introduction to Information Retrieval
Slide is from ChengXiang Zhai’s CS410 Slide is from ChengXiang Zhai’s CS410
– Coverage of information
– Form of presentation
– Effort required/ease of Use
– Time and space efficiency Relevant
– Recall
• proportion of relevant material actually retrieved All docs
– Precision | RelRetrieved | | RelRetrieved |
Recall = Precision =
• proportion of retrieved material actually relevant | Rel in Collection | | Retrieved |
• Intuitions
– Links are like citations in literature
– A page that is cited often can be expected to be more useful
Description in general
(“anchor text”)
– Consider “indirect citations” (being cited by a highly cited
paper counts a lot…)
– Smoothing of citations (every page is assumed to have a non-
zero citation count)
Hub Authority
11/10/2009 Introduction to Information Retrieval 27 11/10/2009 Introduction to Information Retrieval 28
Slide is from ChengXiang Zhai’s CS410
The PageRank Algorithm (Page et al. 98) PageRank: Example
d1
N
Random surfing model: At any page,
d3
p (d j ) = ∑ [ N1 α + (1 − α ) M ij ] p (d i )
With prob. α, randomly jumping to a page d2 i =1
ϖ ϖ
With prob. (1-α), randomly picking a link to follow. p = (αI + (1 − α ) M )T p
d4
Mij = probability of going from di to dj
⎡0 0 1/ 2 1 / 2⎤ ⎡1 / 4 1/ 4 1/ 4 1 / 4⎤
“Transition matrix” ⎢1
N= # pages 0 0 0 ⎥⎥ ⎢1 / 4 1/ 4 1/ 4 1 / 4⎥⎥
d1 A = (1 − 0.2) M + 0.2 I = 0.8 × ⎢ + 0.2 × ⎢
⎢0 1 0 0 ⎥ ⎢1 / 4 1/ 4 1/ 4 1 / 4⎥
⎡0 0 1/ 2 1 / 2⎤ p(di): PageRank score ⎢ ⎥ ⎢ ⎥
⎢1 0 ⎥⎥ ⎣1 / 2 1/ 2 0 0 ⎦ ⎣1 / 4 1/ 4 1/ 4 1 / 4⎦
d3 0 0 (average probability of
M =⎢ visiting page di);
d2 ⎢0 1 0 0 ⎥
⎢ ⎥ ⎡ p n +1 ( d1 ) ⎤ ⎡ p n (d1 ) ⎤ ⎡0.05 0.85 0.05 0.45 ⎤ ⎡ p (d1 ) ⎤
n
⎣1 / 2 1/ 2 0 0 ⎦ ⎢ n +1 ⎥ ⎢ n ⎥ ⎢ ⎥
Iij = 1/N ⎢ p ( d 2 )⎥ T⎢
p (d 2 )⎥ ⎢0.05 0.05 0.85 0.45⎥⎥ ⎢ p n (d 2 ) ⎥
d4 ⎢
⎢ p n +1 ( d ) ⎥ = A ⎢ p n (d ) ⎥ = ⎢0.45 0.05 0.05
×
0.05⎥ ⎢ p n (d 3 ) ⎥
⎢ 3 ⎥ ⎢ 3 ⎥
⎢ ⎥ ⎢ ⎥
⎣⎢ p ( d 4 )⎦⎥ ⎢⎣ p (d 4 )⎦⎥ ⎣0.45
n +1 n 0.05 0.05 0.05⎦ ⎢ p n (d ) ⎥
⎣ 4 ⎦
Initial value p(d)=1/N, iterate until converge
Initial value p(d)=1/N, iterate until converge
11/10/2009 Introduction to Information Retrieval 29 11/10/2009 Introduction to Information Retrieval 30
Slide is from ChengXiang Zhai’s CS410
Examples of Text
Management Applications
• Search
– Web search engines (Google, Yahoo, …)
– Library systems
Applications •
– Literature/movie recommender
Categorization
– Automatically sorting emails
– …
• Mining/Extraction
– Discovering major complaints from email in customer service
– Business intelligence
– Bioinformatics
– …
11/10/2009 Introduction to Information Retrieval 31 11/10/2009 Introduction to Information Retrieval 32
Sample Applications 1) Text Categorization
---------- ----------
sentence 1 in each segment
…
----------
---------- ----------
---------- sentence 2 Doc vector
----------
---------- ----------
----------
----------
---------- ----------
----------
----------
---------- ----------
----------
sentence 3
----------
---------- vector n-1
vector n similarity
----------
----------
41 11/10/2009 42
11/10/2009 Introduction to Information Retrieval Introduction to Information Retrieval
Slide is from ChengXiang Zhai’s CS410 Slide is from ChengXiang Zhai’s CS410
• Can be combined
Collaborative filtering is also called
“Recommender Systems”
11/10/2009 Introduction to Information Retrieval 45 11/10/2009 Introduction to Information Retrieval 46
Slide is from ChengXiang Zhai’s CS410
System Architecture
Meta
Crawler Register Community
Data
(RDBMS)
Indexer Query
Index
Date‐wise Index
Facebook Clusteri
Newsletter
Application ng
(RDBMS) Application Main page
55 56
11/10/2009 Introduction to Information Retrieval 11/10/2009 Introduction to Information Retrieval
Collaborative User Feedback Demo
• Three kinds of user feedback captured •News Recommender on Facebook
– Clickthroughs
– Explicit Ratings
– Inter-person recommendations
• They are linearly combined as follows:
Application Information
• For more information about the application: •We are looking for motivated students to
– http://sifaka.cs.uiuc.edu/ir/proj/rec/ work on this application.
•Requirements:
– DataBase Knowledge
• http://apps.facebook.com/news_letters/
– PHP
– Perl
•Contact me if you are interested:
– mkarimz2@illinois.edu