Vous êtes sur la page 1sur 44

Frontiers of

Computational Journalism
Columbia Journalism School
Week 4: Algorithmic Filtering
October 2, 2015

Journalism as a cycle
CS

Eects

Data

CS

Reporting

User

CS
CS

Filtering

User

User

stories not covered


x

x
x
x

x
x

x
ltering

User

Each day, the Associated Press publishes:


~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interactive

more video on YouTube than


produced by TV networks during
entire 20th century

Google now indexes more web pages


than there are people in the world
400,000,000 tweets per day
estimated 130,000,000 books ever published

10,000 legally-required reports led


by U.S. public companies every day

All New York Times


articles ever =
0.06 terabytes
(13 million stories,
5k per story)

Its not information overload, its lter


failure

- Clay Shirky

System Description
Scrape

Cluster
events

Cluster
topics

Summarize

Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article

Text extraction from HTML

Ideal world: HTML5 article tags

The article element represents a component of a page that consists


of a self-contained composition in a document, page, application,
or site and that is intended to be independently distributable or
reusable, e.g. in syndication.
- W3C Specification

The dismal reality of text extraction

Every site is a beautiful flower.

Text extraction from HTML


Newsblaster paper:
For each page examined, if the amount of text in the
largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.
(At least its simple. This was 2002. How often does this work
now?)

Text extraction from HTML

Now multiple services/apis to do this, e.g.


readability.com

Cluster Events

Cluster Events
Surprise!
encode articles into feature vectors
cosine distance function
hierarchical clustering algorithm

Dierent clustering algorithms


Partitioning
o keep adjusting clusters until convergence
o e.g. K-means

Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches

Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split

But news is an on-line problem...


Articles arrive one at a time, and must be clustered
immediately.
Cant look forward in time, cant go back and
reassign.
Greedy algorithm.

Single pass clustering



put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster

Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.

Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (topic detection and
tracking)

Error with respect to hand-generated clusters from training data

Now sort events into categories


Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.

Primitive operation: what topic is this story in?

TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
worldcategory
latest story

nance category

Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster paper,
and more recent techniques.

Is Newsblaster really a lter?


After all, it shows all the news...
Differences with Google News?

Personalization!
Not every person needs to see the same news.

Filter design problem


Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define

r(S,U,{P},{B}) in [0...1]

relevance of story S to user U

What makes a filtering algorithm "good"?

Editors
Editors are filters. They decide what stories to run, and
how prominent to make them.
How do they choose?

The Echo Chamber


[Echo chambers are] those Internet spaces where like-minded
people listen only to those people who already agree with them.
...
While most of us had assumed that the Internet would increase
the diversity of opinion, the echo chamber meme says the Net
encourages groups to form that increase the homogeneity of
belief. This isnt simply a factual argument about the topography
carved by traffic and links. A tut, tut has been appended: See,
you Web idealists have been shown up humankinds social
nature sucks, just as we always told you!
- David Weinberger, Is there an echo in here?

Graph of political book sales during 2008 U.S. election, by orgnet.org


From Amazon "users who bought X also bought Y" data.

Retweet network of political tweets.


From Conover, et. al., Political Polarization on Twi/er

Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-


Israeli (Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks

The Filter Bubble


What people care about politically, and what theyre motivated to do
something about, is a function of what they know about and what they
see in their media. ... People see something about the deficit on the
news, and they say, Oh, the deficit is the big problem. If they see
something about the environment, they say the environment is a big
problem.
This creates this kind of a feedback loop in which your media influences
your preferences and your choices; your choices influence your media;
and you really can go down a long and narrow path, rather than
actually seeing the whole set of issues in front of us.
- Eli Pariser,
How do we recreate a front-page ethos for a digital world?

The (Algorithmic) Filter Bubble


If we try to present stories that the user will want to
click on... do we end up only telling people what they
want to hear?
If an algorithm only shows us things our friends like, will
we ever see anything that challenges us?

Filter design problem, restated


When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely

Information diet
The holy grail in this model, as far as Im
concerned, would be a Firefox plugin that would
passively watch your websurng behavior and
characterize your personal information
consumption. Over the course of a week, it might
let you know that you hadnt encountered any
news about Latin America, or remind you that a
full 40% of the pages you read had to do with
Sarah Palin. It wouldnt necessarily prescribe
changes in your behavior, simply help you monitor
your own consumption in the hopes that you
might make changes.

- Ethan Zuckerman,
Playing the Internet with PMOG

Vous aimerez peut-être aussi