Algorithmic Filtering. Computational Journalism Week 4

Frontiers of
Computational Journalism
Columbia Journalism School
Week 4: Algorithmic Filtering
October 2, 2015
Journalism as a cycle
CS
Eects
Data
CS
Reporting
User
CS
CS
Filtering
User
User
stories not covered

x
x
x
x
x
x
x
ltering
User
Each day, the Associated Press publishes:

~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interactive
more video on YouTube than

produced by TV networks during
entire 20th century
Google now indexes more web pages

than there are people in the world
400,000,000 tweets per day
estimated 130,000,000 books ever published
10,000 legally-required reports led

by U.S. public companies every day
All New York Times

articles ever =
0.06 terabytes
(13 million stories,
5k per story)
Its not information overload, its lter

failure

- Clay Shirky
System Description
Scrape
Cluster
events
Cluster
topics
Summarize
Scrape
Handcrafted list of source URLs (news front pages)
and links followed to depth 4
Then extract the text of each article
Text extraction from HTML
Ideal world: HTML5 article tags
The article element represents a component of a page that consists

of a self-contained composition in a document, page, application,
or site and that is intended to be independently distributable or
reusable, e.g. in syndication.
- W3C Specification
The dismal reality of text extraction
Every site is a beautiful flower.

Newsblaster paper:
For each page examined, if the amount of text in the
largest cell of the page (after stripping tags and links) is
greater than some particular constant (currently 512
characters), it is assumed to be a news article, and this text
is extracted.
(At least its simple. This was 2002. How often does this work
now?)
Now multiple services/apis to do this, e.g.

readability.com
Cluster Events
Cluster Events
Surprise!
encode articles into feature vectors
cosine distance function
hierarchical clustering algorithm
Dierent clustering algorithms

Partitioning
o keep adjusting clusters until convergence
o e.g. K-means
Agglomerative hierarchical
o start with leaves, repeatedly merge clusters
o e.g. MIN and MAX approaches
Divisive hierarchical
o start with root, repeatedly split clusters
o e.g. binary split
But news is an on-line problem...

Articles arrive one at a time, and must be clustered
immediately.
Cant look forward in time, cant go back and
reassign.
Greedy algorithm.
Single pass clustering

put first story in its own cluster
repeat
get next story S
look for cluster C with distance < T
if found
put S in C
else
put S in new cluster

Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
Evaluating clusterings
When is one clustering better than another?
Ideally, wed like a quantitative metric.
This is possible if we have training data = human generated
clusters.
Available from the TDT2 corpus (topic detection and
tracking)
Error with respect to hand-generated clusters from training data
Now sort events into categories

Categories:
U.S., World, Finance, Science and Technology, Entertainment, Sports.
Primitive operation: what topic is this story in?
TF-IDF, again
Each category has pre-assigned TF-IDF coordinate.
Story category = closest point.
worldcategory
latest story
nance category
Cluster summarization
Problem: given a set of documents, write a sentence
summarizing them.
Difficult problem. See references in Newsblaster paper,
and more recent techniques.
Is Newsblaster really a lter?

After all, it shows all the news...
Differences with Google News?
Personalization!
Not every person needs to see the same news.
Filter design problem

Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1]
relevance of story S to user U
What makes a filtering algorithm "good"?
Editors
Editors are filters. They decide what stories to run, and
how prominent to make them.
How do they choose?
The Echo Chamber

[Echo chambers are] those Internet spaces where like-minded
people listen only to those people who already agree with them.
...
While most of us had assumed that the Internet would increase
the diversity of opinion, the echo chamber meme says the Net
encourages groups to form that increase the homogeneity of
belief. This isnt simply a factual argument about the topography
carved by traffic and links. A tut, tut has been appended: See,
you Web idealists have been shown up humankinds social
nature sucks, just as we always told you!
- David Weinberger, Is there an echo in here?
Graph of political book sales during 2008 U.S. election, by orgnet.org

From Amazon "users who bought X also bought Y" data.
Retweet network of political tweets.

From Conover, et. al., Political Polarization on Twi/er
Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-

Israeli (Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks
The Filter Bubble

What people care about politically, and what theyre motivated to do
something about, is a function of what they know about and what they
see in their media. ... People see something about the deficit on the
news, and they say, Oh, the deficit is the big problem. If they see
something about the environment, they say the environment is a big
problem.
This creates this kind of a feedback loop in which your media influences
your preferences and your choices; your choices influence your media;
and you really can go down a long and narrow path, rather than
actually seeing the whole set of issues in front of us.
- Eli Pariser,
How do we recreate a front-page ethos for a digital world?
The (Algorithmic) Filter Bubble

If we try to present stories that the user will want to
click on... do we end up only telling people what they
want to hear?
If an algorithm only shows us things our friends like, will
we ever see anything that challenges us?
Filter design problem, restated

When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely
Information diet
The holy grail in this model, as far as Im
concerned, would be a Firefox plugin that would
passively watch your websurng behavior and
characterize your personal information
consumption. Over the course of a week, it might
let you know that you hadnt encountered any
news about Latin America, or remind you that a
full 40% of the pages you read had to do with
Sarah Palin. It wouldnt necessarily prescribe
changes in your behavior, simply help you monitor
your own consumption in the hopes that you
might make changes.

- Ethan Zuckerman,
Playing the Internet with PMOG

Algorithmic Filtering. Computational Journalism Week 4

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Algorithmic Filtering. Computational Journalism Week 4

Transféré par

Droits d'auteur :

Formats disponibles

Frontiers of

stories not covered

Each day, the Associated Press publishes:

more video on YouTube than

Google now indexes more web pages

10,000 legally-required reports led

All New York Times

Its not information overload, its lter

Text extraction from HTML

Ideal world: HTML5 article tags

The article element represents a component of a page that consists

The dismal reality of text extraction

Every site is a beautiful flower.

Text extraction from HTML

Text extraction from HTML

Now multiple services/apis to do this, e.g.

Dierent clustering algorithms

But news is an on-line problem...

Single pass clustering

Error with respect to hand-generated clusters from training data

Now sort events into categories

Primitive operation: what topic is this story in?

Is Newsblaster really a lter?

Filter design problem

relevance of story S to user U

What makes a filtering algorithm "good"?

The Echo Chamber

Graph of political book sales during 2008 U.S. election, by orgnet.org

Retweet network of political tweets.

Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-

The Filter Bubble

The (Algorithmic) Filter Bubble

Filter design problem, restated

Vous aimerez peut-être aussi