Vous êtes sur la page 1sur 61

Frontiers of

Computational Journalism
Columbia Journalism School
Week 3: Filtering Algorithms
September 26, 2016

This class

The need for filters


Comment Ranking
Newsblaster
User-item recommendation
New York Times recommendation system

The Need for Filters

Journalism as a cycle
CS

Effects

Data

CS

Reporting

User

CS

Filtering

CS

User

User

stories not covered


x

x
x
x

x
x

x
filtering

User

Each day, the Associated Press publishes:


~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interactive

more video on YouTube than produced


by TV networks during entire 20th century

Google now indexes more web pages


than there are people in the world
400,000,000 tweets per day
estimated 130,000,000 books ever published

10,000 legally-required reports filed


by U.S. public companies every day

All New York Times


articles ever =
0.06 terabytes
(13 million stories,
5k per story)

Its not information overload, its filter


failure
- Clay Shirky

Comment Ranking

Filtering Comments

Thousands of comments, what are the good ones?

Comment voting

Problem: putting comments with most votes at


top doesnt work. Why?

Reddit Comment Ranking (old)

Up down votes
plus time decay

Reddit Comment Ranking (new)


N=16
v = 11
p = 11/16 = 0.6875

Hypothetically, suppose all users voted on the comment, and v


out of N up-voted. Then we could sort by proportion p = v/N of
upvotes.

Reddit Comment Ranking


n=3
v = 1
p = 1/3 = 0.333

Actually, only n users out of N vote, giving an observed


approximate proportion p = v/n

Reddit Comment Ranking


p = 0.75
p = 0.1875

p = 0.333
p = 0.6875

Limited sampling can rank votes wrong when we dont have


enough data.

Random error in sampling


If we observe p upvotes from n random users, what is
the distribution of the true proportion p?

Distribution of p when p=0.5

Confidence interval
Given observed p, interval that true p has a
probability of lying inside.

Rank comments by lower bound


of confidence interval
Analytic solution for confidence interval, known as Wilson score

p = observed proportion of upvotes


n = how many people voted
z= how certain do we want to be before we assume that p is
close to true p

Newsblaster

System Description
Scrape

Cluster
events

Cluster
topics

Summarize

Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article

Text extraction from HTML

Ideal world: HTML5 article tags

The article element represents a component of a page that


consists of a self-contained composition in a document, page,
application, or site and that is intended to be independently
distributable or reusable, e.g. in syndication.
- W3C Specification

The dismal reality of text extraction

Every site is a beautiful flower.

Text extraction from HTML


Newsblaster paper:
For each page examined, if the amount of text in the
largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.
(At least its simple. This was 2002. How often does this work
now?)

Text extraction from HTML


Now multiple services/apis to do this, e.g.
readability.com

Cluster Events

Cluster Events
Surprise!
encode articles into feature vectors
cosine distance function
hierarchical clustering algorithm

Different clustering algorithms


Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

But news is an on-line problem...


Articles arrive one at a time, and must be clustered
immediately.
Cant look forward in time, cant go back and
reassign.
Greedy algorithm.

Single pass clustering


put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster

Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.

Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (topic detection and
tracking)

Error with respect to hand-generated clusters from training data

Now sort events into categories


Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primitive operation: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
worldcategory
latest story

finance category

Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster
paper, and more recent techniques.

Is Newsblaster really a filter?


After all, it shows all the news...
Differences with Google News?

Personalization!
Not every person needs to see the same news.

User-item Recommendation

User-item matrix

Stores rating of each user for each item. Could also


be binary variable that says whether user clicked,
liked, starred, shared, purchased...

User-item matrix

No content analysis. We know nothing about what is in each


item.
Typically very sparse a user hasnt watched even 1% of all
movies.
Filtering problem is guessing unknown entry in matrix. High
guessed values are things user would want to see.

Filtering process

How to guess unknown rating?


Basic idea: suggest similar items.
Similar items are rated in a similar way by many
different users.
Remember, rating could be a click, a like, a
purchase.
o Users who bought A also bought B...
o Users who clicked A also clicked B...
o Users who shared A also shared B...

Similar items

Item similarity
Cosine similarity!

Other distance measures


adjusted cosine similarity

Subtracts average rating for each user, to compensate for general


enthusiasm (most movies suck vs. most movies are great)

Generating a recommendation

Weighted average of item ratings by their similarity.

Matrix factorization recommender

Matrix factorization recommender

Note: only sum over observed ratings rij.

Matrix factorization plate model


variation in
item topics
v

topics for item


v

j items

u
variation in
user topics

i users

topics for user

user rating
of item

New York Times recommender

Combining collaborative filtering


and topic modeling

Content modeling - LDA


topics in doc
topic
topic for word
concentration
parameter

word in doc

N words
in doc

words in topics

D docs

word
concentration
parameter

K topics

Collaborative Topic Modeling


topic
concentration

topics in doc topic for word word in doc


(content)

weight of user topics in doc


(collaborative)
selections

variation in
per-user topics

topics for user

user rating
of doc

K topics

content only
content +
social

Vous aimerez peut-être aussi