Académique Documents
Professionnel Documents
Culture Documents
Computational Journalism
Columbia Journalism School
Week 4: Algorithmic Filtering
October 2, 2015
Journalism as a cycle
CS
Eects
Data
CS
Reporting
User
CS
CS
Filtering
User
User
x
x
x
x
x
x
ltering
User
- Clay Shirky
System Description
Scrape
Cluster
events
Cluster
topics
Summarize
Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article
Cluster Events
Cluster Events
Surprise!
encode articles into feature vectors
cosine distance function
hierarchical clustering algorithm
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (topic detection and
tracking)
TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
worldcategory
latest story
nance category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster paper,
and more recent techniques.
Personalization!
Not every person needs to see the same news.
r(S,U,{P},{B}) in [0...1]
Editors
Editors are filters. They decide what stories to run, and
how prominent to make them.
How do they choose?
Information diet
The holy grail in this model, as far as Im
concerned, would be a Firefox plugin that would
passively watch your websurng behavior and
characterize your personal information
consumption. Over the course of a week, it might
let you know that you hadnt encountered any
news about Latin America, or remind you that a
full 40% of the pages you read had to do with
Sarah Palin. It wouldnt necessarily prescribe
changes in your behavior, simply help you monitor
your own consumption in the hopes that you
might make changes.
- Ethan Zuckerman,
Playing the Internet with PMOG