Vous êtes sur la page 1sur 16

Outline

An Overview of
• Limitations of Database systems (Motivation for
Information Retrieval IR systems)
• Information Retrieval
– Indexing
Nov. 10, 2009
– Similarity Measures
– Evaluation
Maryam Karimzadehgan – Other IR applications
mkarimz2@illinois.edu
• Web Search
- PageRank Algorithm
Department of Computer Science
University of Illinois, Urbana-Champaign
• News Recommender system on Facebook

11/10/2009 Introduction to Information Retrieval 2

A (Simple) Database Example Databases vs. IR


Student Table
Student ID
1
Last Name
Maryam
First Name Department ID
KarimzadehgaCS
email
mkarimz2@uiuc.edu
• Format of data:
2 Peters jordan EE kj@uiuc.edu – DB: Structured data. Clear semantics based on a formal
3 Smith Chris CE sc@uiuc.edu model.
4 Smith John CLIS Sj@uiuc.edu
– IR: Mostly unstructured. Free text.
Department Table Course Table
Department ID
EE
Department
Electrical Engineering
Course ID Course Name • Queries:
lbsc690 Information Technology
CE Computer Engineering ee750 Communication
– DB: Formal (like SQL)
CLIS Information Studies ce098 Computer Architecture – IR: often expressed in natural language (keywords
Enrollment Table search)
Student ID
1
Course ID
lbsc690
Grade
90
• Result:
1 ee750 95 – DB: exact result
2 lbsc690 95
2 hist405 80
– IR: Sometimes relevant, often not
3 hist405 90
4 lbsc690 98 3 11/10/2009 Introduction to Information Retrieval 4
11/10/2009
Short vs. Long Term Info Need What is Information Retrieval?
• Short-term information need (Ad hoc retrieval)
– “Temporary need” • Goal: Find the documents most relevant to a
– Information source is relatively static certain query (information need)
– User “pulls” information
– Application example: library search, Web search • Dealing with notions of:
• Long-term information need (Filtering) – Collection of documents
– “Stable need”, e.g., new data mining algorithms – Query (User’s information need)
– Information source is dynamic – Notion of Relevancy
– System “pushes” information to user
– Applications: news filter
11/10/2009 Introduction to Information Retrieval 5 11/10/2009 Introduction to Information Retrieval 6

What Types of Information? The Information Retrieval Cycle


Source
• Text (Documents) Selection
Resource

• XML and structured documents Query Query


Formulation
• Images Search Ranked List
• Audio (sound effects, songs, etc.)
• Video Selection
Documents
• Source code query reformulation,
relevance feedback

• Applications/Web services result

11/10/2009 Introduction to Information Retrieval 7 11/10/2009 8


Introduction to Information Retrieval
Slide is from Jimmy Lin’s tutorial
The IR Black Box Inside The IR Black Box
Query Documents Query Documents

Representation Representation

Query Representation Document Representation

Comparison
Index
Function

Results Results
11/10/2009 Introduction to Information Retrieval 9 11/10/2009 Introduction to Information Retrieval 10
Slide is from Jimmy Lin’s tutorial Slide is from Jimmy Lin’s tutorial

Typical IR System Architecture 1) Indexing


docs query
• Making it easier to match a query with a
document
Bag of Word Representation
• Query and document should be represented
Tokenizer judgments using the same units/terms
Feedback
INDEX
Doc Rep (Index) Query Rep DOCUMENT
User document
information
This is a document in
Scorer retrieval
Indexer Index results information retrieval
is
this
11/10/2009 Introduction to Information Retrieval 11 11/10/2009 Introduction to Information Retrieval 12
Slide is from ChengXiang Zhai’s CS410
What is a good indexing term? Inverted Index
• Specific (phrases) or general (single word)? Doc 1
Dictionary Postings
• Words with middle frequency are most useful This is a sample
document
Term # Total Doc id Freq
with one sample docs freq
– Not too specific (low utility, but still useful!) sentence 1 1
This 2 2
2 1
– Not too general (lack of discrimination, stop words) is 2 2
Doc 2 1 1
sample 2 3
– Stop word removal is common, but rare words are 2 1
another 1 1
kept in modern search engines … … …
1 2

This is another 2 1
• Stop words are words such as:
sample document 2 1
– a, about, above, according, across, after, afterwards,
… …
again, against, albeit, all, almost, alone, already,
… …
also, although, always, among, as, at

11/10/2009 Introduction to Information Retrieval 13 11/10/2009 Introduction to Information Retrieval 14

2) Tokenization/Stemming 3) Relevance Feedback


• Stemming: Mapping all inflectional forms of Results:
words to the same root form, e.g. d1 3.5
Query Retrieval
d2 2.4
– computer -> compute Engine

– computation -> compute dk 0.5
...
Updated
– computing -> compute User
query Document
collection Judgments:

• Porter’s Stemmer is popular for English d1 +


d2 -
d3 +

Feedback
dk -
...

11/10/2009 Introduction to Information Retrieval 15 11/10/2009 Introduction to Information Retrieval 16


Slide is from ChengXiang Zhai’s CS410
4) Scorer/Similarity Methods Boolean Model
1) Boolean model • Each index term is either present or absent
2) Vector-space model
• Documents are either Relevant or Not Relevant(no
ranking)
3) Probabilistic model
4) Language model • Advantages
– Simple

• Disadvantages
– No notion of ranking (exact matching only)
– All index terms have equal weight

11/10/2009 Introduction to Information Retrieval 17 11/10/2009 Introduction to Information Retrieval 18

Vector Space Model TF-IDF in Vector Space model

• Query and documents are represented as


vectors of index terms
⎛N ⎞
tfidf i ,k = f i ,k * log⎜⎜ ⎟⎟
• Similarity calculated using COSINE similarity ⎝ df i ⎠
between two vectors
TF part
– Ranked according to similarity IDF part

IDF: A term is more discriminative if it occurs only in fewer documents

11/10/2009 Introduction to Information Retrieval 19 11/10/2009 Introduction to Information Retrieval 20


Language Models for Retrieval Retrieval as
Language Model Estimation
Document
Language Model • Document ranking based on query
… likelihood
log p (q | d ) = ∑ log p (w i | d )
text ?
mining ? Query =
Text mining assocation ? “data mining algorithms” i
paper clustering ?
… where , q = w 1w 2 ...w n Document language model
food ?


? Which model would most
likely have generated
this query?
• Retrieval problem ≈ Estimation of p(wi|d)
Food nutrition food ?
nutrition ?
• Smoothing is an important issue, and
paper healthy ? distinguishes different approaches
diet ?

11/10/2009 21 11/10/2009 Introduction to Information Retrieval 22
Introduction to Information Retrieval
Slide is from ChengXiang Zhai’s CS410 Slide is from ChengXiang Zhai’s CS410

Precision vs. Recall


Information Retrieval Evaluation
Retrieved

– Coverage of information
– Form of presentation
– Effort required/ease of Use
– Time and space efficiency Relevant
– Recall
• proportion of relevant material actually retrieved All docs
– Precision | RelRetrieved | | RelRetrieved |
Recall = Precision =
• proportion of retrieved material actually relevant | Rel in Collection | | Retrieved |

11/10/2009 Introduction to Information Retrieval 23 11/10/2009 Introduction to Information Retrieval 24


Characteristics of Web Information
• “Infinite” size
– Static HTML pages
– Dynamically generated HTML pages (DB)
Web Search – • Semi-structured
Google PageRank Algorithm – Structured = HTML tags, hyperlinks, etc
– Unstructured = Text
• Different format (pdf, word, ps, …)
• Multi-media (Textual, audio, images, …)
• High variances in quality (Many junks)

11/10/2009 Introduction to Information Retrieval 25 11/10/2009 Introduction to Information Retrieval 26

PageRank: Capturing Page


Exploiting Inter-Document Links “Popularity”
[Page & Brin 98]
“Extra text”/summary for a doc

• Intuitions
– Links are like citations in literature
– A page that is cited often can be expected to be more useful
Description in general
(“anchor text”)
– Consider “indirect citations” (being cited by a highly cited
paper counts a lot…)
– Smoothing of citations (every page is assumed to have a non-
zero citation count)

Links indicate the utility of a doc

Hub Authority
11/10/2009 Introduction to Information Retrieval 27 11/10/2009 Introduction to Information Retrieval 28
Slide is from ChengXiang Zhai’s CS410
The PageRank Algorithm (Page et al. 98) PageRank: Example
d1
N
Random surfing model: At any page,
d3
p (d j ) = ∑ [ N1 α + (1 − α ) M ij ] p (d i )
With prob. α, randomly jumping to a page d2 i =1
ϖ ϖ
With prob. (1-α), randomly picking a link to follow. p = (αI + (1 − α ) M )T p
d4
Mij = probability of going from di to dj
⎡0 0 1/ 2 1 / 2⎤ ⎡1 / 4 1/ 4 1/ 4 1 / 4⎤
“Transition matrix” ⎢1
N= # pages 0 0 0 ⎥⎥ ⎢1 / 4 1/ 4 1/ 4 1 / 4⎥⎥
d1 A = (1 − 0.2) M + 0.2 I = 0.8 × ⎢ + 0.2 × ⎢
⎢0 1 0 0 ⎥ ⎢1 / 4 1/ 4 1/ 4 1 / 4⎥
⎡0 0 1/ 2 1 / 2⎤ p(di): PageRank score ⎢ ⎥ ⎢ ⎥
⎢1 0 ⎥⎥ ⎣1 / 2 1/ 2 0 0 ⎦ ⎣1 / 4 1/ 4 1/ 4 1 / 4⎦
d3 0 0 (average probability of
M =⎢ visiting page di);
d2 ⎢0 1 0 0 ⎥
⎢ ⎥ ⎡ p n +1 ( d1 ) ⎤ ⎡ p n (d1 ) ⎤ ⎡0.05 0.85 0.05 0.45 ⎤ ⎡ p (d1 ) ⎤
n

⎣1 / 2 1/ 2 0 0 ⎦ ⎢ n +1 ⎥ ⎢ n ⎥ ⎢ ⎥
Iij = 1/N ⎢ p ( d 2 )⎥ T⎢
p (d 2 )⎥ ⎢0.05 0.05 0.85 0.45⎥⎥ ⎢ p n (d 2 ) ⎥
d4 ⎢
⎢ p n +1 ( d ) ⎥ = A ⎢ p n (d ) ⎥ = ⎢0.45 0.05 0.05
×
0.05⎥ ⎢ p n (d 3 ) ⎥
⎢ 3 ⎥ ⎢ 3 ⎥
⎢ ⎥ ⎢ ⎥
⎣⎢ p ( d 4 )⎦⎥ ⎢⎣ p (d 4 )⎦⎥ ⎣0.45
n +1 n 0.05 0.05 0.05⎦ ⎢ p n (d ) ⎥
⎣ 4 ⎦
Initial value p(d)=1/N, iterate until converge
Initial value p(d)=1/N, iterate until converge
11/10/2009 Introduction to Information Retrieval 29 11/10/2009 Introduction to Information Retrieval 30
Slide is from ChengXiang Zhai’s CS410

Examples of Text
Management Applications
• Search
– Web search engines (Google, Yahoo, …)
– Library systems

Beyond Just Search – •


– …
Filtering

Information Retrieval – News filter


– Spam email filter

Applications •
– Literature/movie recommender
Categorization
– Automatically sorting emails
– …
• Mining/Extraction
– Discovering major complaints from email in customer service
– Business intelligence
– Bioinformatics
– …
11/10/2009 Introduction to Information Retrieval 31 11/10/2009 Introduction to Information Retrieval 32
Sample Applications 1) Text Categorization

•1) Text Categorization • Pre-given categories and labeled document


examples (Categories may form hierarchy)
•2) Document/Term Clustering • Classify new documents
•3) Text Summarization • A standard supervised learning problem
•4) Filtering
Sports
Categorization
Business
System
Education
… …
Sports
Science
Business

11/10/2009 Introduction to Information Retrieval 33 11/10/2009 Education 34


Slide is from ChengXiang Zhai’s CS410
Introduction to Information Retrieval

K-Nearest Neighbor Classifier Example of K-NN Classifier


• Keep all training examples
• Find k examples that are most similar to the new (k=4)
(k=1)
document (“neighbor” documents)
• Assign the category that is most common in these
neighbor documents (neighbors vote for the
category)
• Can be improved by considering the distance of a
neighbor ( A closer neighbor has more influence)
• Technical elements (“retrieval techniques”)
– Document representation
– Document distance measure

11/10/2009 Introduction to Information Retrieval 35 11/10/2009 Introduction to Information Retrieval 36


Slide is from ChengXiang Zhai’s CS410 Slide is from ChengXiang Zhai’s CS410
Examples of Text Categorization 2) The Clustering Problem

• News article classification • Group similar objects together


• Meta-data annotation • Object can be document, term, passages
• Automatic Email sorting • Example
• Web page classification

11/10/2009 Introduction to Information Retrieval 37 11/10/2009 Introduction to Information Retrieval 38


Slide is from ChengXiang Zhai’s CS410

Similarity-based Clustering Examples of Doc/Term Clustering

• Define a similarity function to measure • Clustering of retrieval results


similarity between two objects • Clustering of documents in the whole
• Gradually group similar objects together in a collection
bottom-up fashion • Term clustering to define “concept” or
• Stop when some stopping criterion is met “theme”

11/10/2009 Introduction to Information Retrieval 39 11/10/2009 Introduction to Information Retrieval 40


3) Summarization - Simple
A Simple Summarization Method
Discourse Analysis
----------
----------
vector 1 similarity ----------
----------
---------- ----------
vector 2
vector 3 similarity summary
----------
---------- … ----------
---------- Most similar

---------- ----------
sentence 1 in each segment

----------
---------- ----------
---------- sentence 2 Doc vector
----------
---------- ----------
----------
----------
---------- ----------
----------
----------
---------- ----------
----------
sentence 3

----------
---------- vector n-1
vector n similarity
----------
----------
41 11/10/2009 42
11/10/2009 Introduction to Information Retrieval Introduction to Information Retrieval
Slide is from ChengXiang Zhai’s CS410 Slide is from ChengXiang Zhai’s CS410

Examples of Summarization 4) Filtering


• News summary •Content-based filtering (adaptive filtering)
• Summarize retrieval results •Collaborative filtering (recommender
– Single doc summary systems)
– Multi-doc summary

11/10/2009 Introduction to Information Retrieval 43 11/10/2009 Introduction to Information Retrieval 44


Content-based Filtering vs.
Examples of Information Filtering
Collaborative Filtering
• News filtering
• Email filtering • Basic filtering question: Will user U like item
X?
• Movie/book recommenders such as • Two different ways of answering it
Amazon.com
– Look at what U likes => characterize X => content-based filtering
• Literature recommenders – Look at who likes X => characterize U => collaborative filtering

• Can be combined
Collaborative filtering is also called
“Recommender Systems”
11/10/2009 Introduction to Information Retrieval 45 11/10/2009 Introduction to Information Retrieval 46
Slide is from ChengXiang Zhai’s CS410

Adaptive Information Filtering Collaborative Filtering


• Stable & long term interest • Making filtering decisions for an individual
• System must make a delivery decision user based on the judgments of other users
immediately as a document “arrives” • Inferring individual’s interest/preferences
my interest: from that of other similar users
• General idea
… Filtering – Given a user u, find similar users {u1, …, um}
System
– Predict u’s preferences based on the preferences
of u1, …, um

11/10/2009 Introduction to Information Retrieval 47 11/10/2009 Introduction to Information Retrieval 48


Slide is from ChengXiang Zhai’s CS410
Collaborative Filtering:
Collaborative Filtering: Intuitions
Assumptions
• Users with a common interest will have similar • User similarity (user X and Y)
preferences – If X liked the movie, Y will like the movie
• Users with similar preferences probably share
the same interest
• Examples • Item similarity
– Since 90% of those who liked Star Wars also
– “interest is IR” => “favor SIGIR papers” liked Independence Day, and, you liked Star
– “favor SIGIR papers” => “interest is IR” Wars
• Sufficiently large number of user preferences – You may also like Independence Day
are available

11/10/2009 Introduction to Information Retrieval 49 11/10/2009 Introduction to Information Retrieval 50

A Formal Framework for Rating


Objects: O

Users: U o1 o2 … oj … on Xij=f(ui,oj)=? News Recommendation on


u1
u2
3 1.5 …. … 2
Facebook
The task
… 2

ui 1 ?
Assume known f values for
some (u,o)’s http://sifaka.cs.uiuc.edu/ir/proj/rec/
...
• Predict f values for other
(u,o)’s
um 3
• Essentially function
approximation, like other
Unknown function learning problems
11/10/2009 f: U x O→ R 51 11/10/2009 Introduction to Information Retrieval 52
Introduction to Information Retrieval
Slide is from ChengXiang Zhai’s CS410
Facebook as a medium for
Motivation recommendations
• Provides a great platform with in built social
networks.
• More than 120 million users log on to Facebook at
least once each day.
Newsletter
Organizer
• More than 95% of the users have used at least one
application built on the Facebook Platform.
• Possible to make applications that deeply
integrate into a user's Facebook experience.
– FBML (Facebook Markup language)
www – FBJS (Facebook Javascript)
– FQL (Facebook Query Language)
11/10/2009 Introduction to Information Retrieval 53 – Facebook API 54
11/10/2009 Introduction to Information Retrieval

System Architecture

Meta
Crawler Register Community
Data
(RDBMS)

Indexer Query
Index
Date‐wise Index 

Facebook Clusteri
Newsletter
Application ng
(RDBMS) Application Main page
55 56
11/10/2009 Introduction to Information Retrieval 11/10/2009 Introduction to Information Retrieval
Collaborative User Feedback Demo
• Three kinds of user feedback captured •News Recommender on Facebook
– Clickthroughs
– Explicit Ratings
– Inter-person recommendations
• They are linearly combined as follows:

Where Fij is aggregating all kinds of feedback for


article aj from user ui
11/10/2009 Introduction to Information Retrieval 57 11/10/2009 Introduction to Information Retrieval 58

Application Information
• For more information about the application: •We are looking for motivated students to
– http://sifaka.cs.uiuc.edu/ir/proj/rec/ work on this application.
•Requirements:
– DataBase Knowledge
• http://apps.facebook.com/news_letters/
– PHP
– Perl
•Contact me if you are interested:
– mkarimz2@illinois.edu

11/10/2009 Introduction to Information Retrieval 59 11/10/2009 Introduction to Information Retrieval 60


Thanks

11/10/2009 Introduction to Information Retrieval 61

Vous aimerez peut-être aussi