Vous êtes sur la page 1sur 6
CS419/519 Information Filtering and Retrieval Jon Herlocker Dept. of Computer Science Oregon State University Tu/Th

CS419/519

Information Filtering and Retrieval

Jon Herlocker Dept. of Computer Science Oregon State University Tu/Th Day 2

• • •

Today

Project idea presentations

Introduction to IR

Assignments for next week

•

What is Information Retrieval?

How does it differ from databases (data retrieval)?

Review of Last Lecture • Tour of an research information retrieval system – Traditional IR

Review of Last Lecture

• Tour of an research information retrieval system

– Traditional IR search engine

– Collaborative filtering recommendations

– Crawling, pre-processing, and indexing

– Personalization

– User study

Project Idea Presentations

Project Idea Presentations

Data Retrieval vs. Information Retrieval

 

Data retrieval

Information retrieval

Content

Data

Information Document, image, other Partial match, best match Relevant Natural Incomplete Probabilistic Less structured

Data object

Table

Matching

Exact match

Items wanted

Matching

Query language

SQL(artificial)

Query specification

Complete

Model

Deterministic

Highly structured

Table by Xin Xao, Drexel University

• •

Data Retrieval vs. Information Retrieval

Information retrieval solutions may incorporate data retrieval

– Data retrieval as a subset of information retrieval

For this class, data retrieval alone is not interesting

Traditional Information Need Model 1. User has an information need 2. Users forms a query

Traditional Information Need Model

1. User has an information need

2. Users forms a query

3. IR system makes best match with documents

4. User evaluates ranked documents

5. Is the need met?

6a. Yes -> done 6b. No -> reformulate query

• • •

Relevance?

What does relevance mean?

A document might be relevant for many reasons

– Answers a question with a fact

– Gives part of an answer

– Gives link to the answer

– Gives related information

Relevance is subjective!

– We’ll return to this discussion later

Next Models of information retrieval

Next

Models of information retrieval

• • • •

Some Assumptions of Traditional Model

It is possible for the user to specify their exact needs

Document texts are functionally equivalent to information needs

– Essentially document retrieval

The information need remains constant throughout search process

The user will always recognize relevant documents

(Belkin, Oddy, & Brooks)

• •

Other Models

Anomalous States of Knowledge (ASKs)

– (Belkin, Oddy, and Brooks)

– A recognized anomaly in a user’s state

of knowledge that the user is not able to specify specifically

Berry-Picking Model

– (Bates 90)

– Interesting information scattered like

berries on bushes

– The query is continually shifting

• •

Takeaway Messages

Modeling information need and user activity is complex

Be aware of simplistic assumptions that most IR work makes

Be aware of simplistic assumptions that most IR work makes MIR – Chap. 2 - Models
MIR – Chap. 2 - Models • Goal is to retrieve documents that are relevant

MIR – Chap. 2 - Models

• Goal is to retrieve documents that are relevant to a user’s single information need

– Based on those traditional assumptions discussed earlier

Textbook Roadmap
Textbook Roadmap
traditional assumptions discussed earlier Textbook Roadmap • • • Models for Retrieval What documents best
• • •

Models for Retrieval

What documents best match the need described by a query?

Modeling

– Information need

• From a query

– Information content

• From a document

– Closeness or similarity

• Between need and content

Based on content analysis

– Measurable attributes of queries and documents

Generic Document Retrieval Model Best Matching Documents Ranking Algorithm Representation of Document Content

Generic Document Retrieval Model

Best Matching Documents Ranking Algorithm
Best Matching
Documents
Ranking
Algorithm
Representation of Document Content
Representation of
Document Content
Documents
Documents
Information Need Query Language
Information Need
Query Language
Representation of Information Need Prior Knowledge & Assumptions
Representation
of Information Need
Prior Knowledge &
Assumptions
• • •

Index Terms

Roughly a word or phrase describing the content of a document

Manual Indexing

– A human reads or scans a document and assigns it index terms

– (i.e. Library of Congress subject terms)

Automatic Indexing

– Full-text indexing

– Every word in the document becomes an index term

• • •

Generic Full-Text Indexing Definitions

K = {k 1 ,

, k t }

– Set of all index terms

w i , j > 0

– Weight of term k i in document d j

– w i , j = 0 if k i is not in d j

Each document is describe as a vector

d j = (w 1 , j , w 2 , j ,

, w t , j )

Modern Information Retrieval, Baeza-Yates & Ribeiro-Neto

• •

Models Covered

In this class:

Boolean Model

Vector Space Model

Potentially more if time permits and interest.

• • •

Full-text Indexing Models

Boolean model

Vector space model

Probabilistic model

• •

Boolean Model

Compute vectors for each document (d j )

If a keyword k i appears in d j , then w i , j is 1, otherwise 0

• •

Boolean Query Language

Terms

– Words

– Phrases

Operators

– AND

– OR

– NOT

• •

Rules for Boolean Logic

DeMorgan’s Law

– NOT (A AND B) = (NOT A) OR (NOT B)

– NOT(A OR B) = (NOT A) AND (NOT B)

Search for “Boolean Logic” if you want to know more

• •

Pseudo Boolean Notation

House +Corvallis

– (Corvallis AND House) OR Corvallis

+House +Condo Corvallis Salem

– (House and Condo and Corvallis) OR (House and Condo and Salem) OR (House and Condo)

• • • • • •

Example Boolean Queries

House

House AND Corvallis

House OR Corvallis

(House OR Condo) AND Corvallis

House AND Oregon AND NOT Eugene

(House OR Condo) AND Corvallis and NOT Eugene

• • • •

Informal Pseudo Boolean Notation

Evolved in web search engines

+house +Corvallis

– house AND Corvallis

+house +Oregon –Eugene

– House and Oregon and NOT Eugene

House +Corvallis

– ?

• •

Ordering of Retrieved Documents

Pure Boolean has no order

– All returned documents are equally relevant

In reality, different approaches can be taken

– Chronologically

– Order by number of times a specified term occurs

– Other approaches – get further and further away from Boolean.

• •

Boolean Searching

Upsides

– Easy to implement

– Simple queries are easy

– Query language gives significant control over results

Downsides

– Binary relevance decision

• No ordering criteria

• Usually too much or too little

– Syntax can be complex

• •

Boolean model – Data Retrieval or Information Retrieval?

Very close – hard to distinguish

Differences

– Enormous number of attributes

– A document only has values for a few of those attributes

– Inefficient to store and search using traditional data retrieval methods

– Ordering may still be important

• •

Who Uses Boolean Searches

Everybody until about ten-fifteen years ago

Even now, many commercial systems (library catalogs, abstracting services, etc)

Proximity operators • NEAR • WITHIN

Proximity operators

• NEAR

• WITHIN