Vous êtes sur la page 1sur 37

INFORMATION RETRIEVAL

UNIT 4: WEB SEARCH – PART 3

By
A.BHUvaneshwari
810016104018
B.E. CSE – IV Year – ‘A’ Section

UNIT 4 : INFORMATION RETRIEVAL 1


Unit 4
 Collaborative Filtering
user-user collaborative filtering
item-item collaborative filtering
 Content-based recommendation
 Invisible web
 Snippet generation
Collaborative Filtering
Collaborative Filtering(Con…)
Similarity of Users-Example
HP1 HP2 HP3 TW SW1 SW2 SW3

A 4 5 1

B 5 5 4

C 2 4 5

D 3 3
Collaborative Filtering(Con…)
• In above Example,
A,B,C,D are users
HP1,HP2,HP3,TW,SW1,SW2,SW3 are movies
scale 0 to 5.
• Consider users A,B,C,D with rating vectors ra,rb,rc and rd.
• Similarity between two users – sim(A,B)
• Capture intuition that sim(A,B)>sim(A,C)
Option1:Jaccard Similarity
 
Formula:
Sim(A,B)=|ra rb|/|rA U rB|
Sim(A,C)=|rA rB|/rA U rC|
Sim(A,B)=1/5=0.2
Sim(A,C)=2/4=0.5
Sim(A,B) < Sim(A,C)
Problem:
Ignores the rating values
Option2:Cosine Similarity
Formula:
Sim(A,B)=cos(rA,rB)
A[4 0 0 5 1 0 0 ]
B[5 5 4 0 0 0 0 ]
Sim(A,B)=0.37
Sim(A,C)=0.32
Sim(A,B) > Sim(A,C)
Problem:
Treat missing rate as negative
Option3:Centered Cosine similarity
• Normalize Ratings by
subtracting row mean
• Example:
• Mean for row1=4+5+1/3=10/3
• Mean for row2=5+5+4/3=14/3
• Mean for row3=2+4+5/3=11/3
• Mean for row4=3+3/2=6/2=3
Option3:Centered Cosine similarity(Con..)

Given table Normalize table

HP1 HP2 HP3 SW TW1 TW2 TW3 HP1 HP2 HP3 SW TW1 TW2 TW3

A 4 5 1 A 2/3 5/3 -7/3

B 5 5 4 B 1/3 1/3 -2/3

C 2 4 5 C -5/3 1/3 4/3

D 3 3 D 0 0
Option3:Centered Cosine similarity(Con…)
• Sum of ratings in any row is equal to zero(0).
• “0” is the average rating
• positive ratings are user like the movie more than average
• Negative ratings are user like the movie less than average
• Sim(A,B)=cos(rA,rb)=0.09
• Sim(A,C)=cos(rA,rC)=-0.56
• Sim(A,B) > Sim(A,C)
• Treat missing rate as “average”
• Also known as “Pearson Correlation”
Rating Predictions
Item-Item Collaborative Filtering
Item-Item Collaborative Filtering(Con….)
Example – Rating Predictions
Item-Item CF(Con…..)
Item-Item CF(...)
• Here we use Pearson correlation as similarity:
1)Subtract mean rating m, from each movie I [ i=1,2…6]
example:
m1={1+3+5+5+4}/5=3.6
row1={-2.6,0,-0.6,0,0,1.4,0,0,1.4,0,0.4,0}
2)Compute cosine similarities between rows.
• Neighbour selection:
N=2
Identify movies similar to movie 1,rated by user 5
Item-Item CF(….)
Item-Item CF(Con…)

Cosine similarity:
S(1,2)=-0.18 S(1,3)=0.41
S(1,4)=-0.10 S(1,5)=-0.31
S(1,6)=0.59
Weighted average:
r15=[0.41*2 + 0.59*3]/0.41+0.59
r15=2.6
Content-based Recommendations
Plan of Action
Item Profiles
 For each item , create an item profile
 Profile is a set of features
Movies : actor , director,…..
People : set of friends
Text Features:
 Profile=set of “important “ words in item
(document)
 How to pick important words?
Using TF-IDF
(Term frequency * Inverse Document
frequency)
User Profiles
• User has rated items with profiles i1,i2,………..,in
• Simple:
(weighted ) average of rated item profiles
• Variant:
Normalize weights using average rating of user
Example1:Boolean Utility Matrix
Example2:Star Ratings
INVISIBLE WEB

• What we access every day through popular search engines like


Google ,Yahoo is referred as the “Surface Web”.
contents are indexed by search engines
• Invisible web is a part of World Wide Web whose contents are not
indexed by a standard web search-engines.
• It is also called as “Deep web” and “hidden web”.
Invisible web(con…..)
• It is comprised of :

 private websites , such as sites that require passwords and logins


 Limited access content sites
 Unlinked content , without hyperlinks to other pages , which
 prevents web crawlers from accessing information.
 scripted content , pages only accessible using Java Script ,
 as well as content downloaded using Flash
Invisible web search Engines

•The popular invisible web search engines :


•“ Clusty ” is a meta search engine
•“ Complete Planet ”searches more than 70,000 databases and specialty
search engines found only in the invisible web.
Info Mine
Internet Archive
Digital Librarian
The Internet Public Library
Invisible web search Tools
• Here is a small sampling of invisible web search
tools(directories , engines)to help you find invisible content:
• Books Online-The Online Books Page
• Finance and Investing-Bankrate.com
• International-International Databases
• Economic and Job Data-FreeLunch.com
Invisible web(con…)
Invisible web(con…)
• Surface web:
Visible and legal websites that are all use in our daily life.
Example : Google
Deep web:
All visible but illegal to access websites , that we may use.
Example : Govt. sites
Dark web:
Invisible and illegal web , that cannot allowed to access.
Example : Tor
Snippet Generation
• A snippet is a short summary of the document , which is designed a so
as to allow user to decide its relevance.
• Snippet Query-dependent document summary
• Snippet generation has the following parts
 Key Term Extraction
 Sentence Extraction
 Top sentence Identification
 Snippet Unit Identification
 Snippet generation
Snippet generation module
Snippet generation(con…)
KEY TERM EXTRACTION:
Query Term Extraction
Tittle Words Extraction
Meta keyword extraction
SENTENCE EXTRACTION:
Text Filtering
Sentence Extraction
Snippet generation(con…)
• TOP SENTENCE IDENTIFICATION:
1)All the extracted sentences are now searched for the keywords,
i.e., query terms, tittle words and meta keywords.
2)Extracted sentences are given some weight according to the search
and ranked on the basis of the calculated weight.
3)This module has 2 sub modules:
Weight Assigning
Sentence Ranking
Snippet generation (con..)
• Weight Assigning:
This sub module calculates the weight of each sentence in the
document . There are basic 3 components in a sentence weight like
query term dependent score
tittle word dependent score
meta keyword dependent score
These 3 components are calculated and added to get the final
weight of sentence.
Snippet generation(con..)
• Sentence Ranking:
After calculating weights of all the sentences in the document,
sentences are sorted in descending order of their weight.
In this process if any two or more than two sentences get equal
weight, then they sorted in the ascending order of their positional
Value, i.e., the sentence number in the document.
Now, top 3 ranked sentences are taken for snippet generation.
Snippet generation(con..)
SNIPPET UNIT SELECTION:
1)Snippet unit extraction
2) Weight assigning
3)Snippet unit ranking
1)Snippet unit extraction:
 If the total length of the top 3 ranked sentences of the
document is larger than the maximum length of the snippet, then all
these 3 sentences are split into snippets.
 The sentences are split into snippet units according to
brackets, semi colon , coma etc.
Snippet generation(con…)
2)Weight Assigning:
Weight of the extracted snippet units have be calculated to identify
The most relevant and most important snippet units.
3)Snippet unit ranking:
Snippet ranking unit provides the ranked list of snippet units.
SNIPPET GENERATION:
This module generates the snippet from the snippet units.
This module select the ranked snippet units until the maximum
length of the snippet has been reached.

Vous aimerez peut-être aussi