Académique Documents
Professionnel Documents
Culture Documents
Computational Journalism
Columbia Journalism School
Week 3: Filtering Algorithms
September 26, 2016
This class
Journalism as a cycle
CS
Effects
Data
CS
Reporting
User
CS
Filtering
CS
User
User
x
x
x
x
x
x
filtering
User
Comment Ranking
Filtering Comments
Comment voting
Up down votes
plus time decay
p = 0.333
p = 0.6875
Confidence interval
Given observed p, interval that true p has a
probability of lying inside.
Newsblaster
System Description
Scrape
Cluster
events
Cluster
topics
Summarize
Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article
Cluster Events
Cluster Events
Surprise!
encode articles into feature vectors
cosine distance function
hierarchical clustering algorithm
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (topic detection and
tracking)
TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
worldcategory
latest story
finance category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster
paper, and more recent techniques.
Personalization!
Not every person needs to see the same news.
User-item Recommendation
User-item matrix
User-item matrix
Filtering process
Similar items
Item similarity
Cosine similarity!
Generating a recommendation
j items
u
variation in
user topics
i users
user rating
of item
word in doc
N words
in doc
words in topics
D docs
word
concentration
parameter
K topics
variation in
per-user topics
user rating
of doc
K topics
content only
content +
social