Vous êtes sur la page 1sur 44

1

About me ...

: Computer Science

: Information Systems
: Computer Science / Cognitive Science
: Artificial Intelligence
: Business Science
: Economics

Motivation / Excecutive summary

Dem Volk
aufs Maul
sehen :

Agenda
Concepts
From online buzz to text mining
Text mining: first steps
Text mining: going deeper
Closing the loop: from blogs to blogs
The representativeness challenge
So ...

Agenda
Concepts
From online buzz to text mining
Text mining: first steps
Text mining: going deeper
Closing the loop: from blogs to blogs
The representativeness challenge
So ...

Whats a blog?

a (more or less) frequently updated publication on the Web,


sorted in (usually reverse) chronological order of the
constituent blog posts.

The content may reflect any interests including personal,


journalistic or corporate.

Usually textual, but multimedia forms exist (photoblog, vblog,


)

Blogs and other social media (Web 2.0)

Annotation platforms
(e.g., del.icio.us)

Wikis
(e.g., Wikipedia)
Social network sites
(e.g., MySpace)

Microblogging
(e.g., Twitter)

Blogs
(e.g., Livejournal; Huffington Post)
Sharing / linking by:
Hyperlinks, comments, blogroll, trackback links

Blogs and other social media,


and some of their origins in older media
Computer-supported
cooperative work

Bookmarks

Chatrooms
Annotation platforms
(e.g., del.icio.us)

Wikis
(e.g., Wikipedia)

Usenet

Social network sites


(e.g., MySpace)
Dating sites

www.dmoz.org

Microblogging
(e.g., Twitter)

Blogs
(e.g., Livejournal; Huffington Post)
Sharing / linking by:
Hyperlinks, comments, blogroll, trackback links

Diaries

(Often political) journalism

PR; press releases

Whats market research?

identification, collection, analysis, and dissemination of


information

for the purpose of assisting management in decision making


related to the identification and solution of problems and
opportunities in marketing

11

Traditional methods of (consumer) market research

Based on questioning:

Focus groups, surveys, questionnaires, ...

Based on observations:

Ethnographic studies - observe social phenomena in their natural


setting - observations can occur cross-sectionally or
longitudinally - examples include product-use analysis and
computer cookie traces.

Experimental techniques - create a quasi-artificial environment to


try to control spurious factors, then manipulates at least one of
the variables - examples include purchase laboratories and test
markets

12

Agenda
Concepts
From online buzz to text mining
Text mining: first steps
Text mining: going deeper
Closing the loop: from blogs to blogs
The representativeness challenge
So ...

13

Capturing online buzz:


buzz
Bursty communication actitivies

14

Comparing
search volume,
news
and
blogs

15

The idea of text mining ...


... is to go beyond frequency-counting
... is to go beyond the search-for-documents framework
... is to find patterns (of meaning) within and across documents
(yes, there is text mining behind some of the things the above tools do!)

16

The steps of text mining


1.

Application understanding

2.

Corpus generation

3.

Data understanding

4.

Text preprocessing

5.

Search for patterns / modelling

Topical analysis

Sentiment analysis / opinion mining

6.

Evaluation

7.

Deployment

17

Agenda
Concepts
From online buzz to text mining
Text mining: first steps
Text mining: going deeper
Closing the loop: from blogs to blogs
The representativeness challenge
So ...

18

Application understanding; Corpus generation

What is the question?

What is the context?

What could be interesting sources, and where can they be


found?

Crawl

Use a search engine and/or archive

Google blogs search

Technorati

Blogdigger

...

19

Preprocessing (1)
Data cleaning

Goal: get clean ASCII text

Remove HTML markup, pictures, advertisements, ...

Automate this: wrapper induction

20

Preprocessing (2)
Further text preprocessing

Goal: get processable lexical / syntactical units

Tokenize (find word boundaries)

Lemmatize / stem

ex. buyers, buyer buyer / buyer, buying, ... buy

Remove stopwords

Find Named Entities (people, places, companies, ...); filtering

Resolve polysemy and homonymy: word sense disambiguation;


synonym unification

Part-of-speech tagging; filtering of nouns, verbs, adjectives, ...

...

Most steps are optional and application-dependent!

Many steps are language-dependent; coverage of non-English varies

Free and/or open-source tools or Web APIs exist for most steps

21

Preprocessing (3)
Creation of text representation

Goal: a representation that the modelling algorithm can work


on

Most common forms: A text as

a set or (more usually) bag of words / vector-space


representation: term-document matrix with weights reflecting
occurrence, importance, ...

a sequence of words

a tree (parse trees)

22

Recall text data pre-processing ...

23

An important part of preprocessing:


Named-entity recognition (1)

24

An important part of preprocessing:


Named-entity recognition (2)

Technique: Lexica, heuristic rules, syntax parsing

Re-use lexica and/or develop your own

configurable tools such as GATE

A challenge: multi-document named-entity recognition

See proposal in Subai & Berendt (Proc. ICDM 2008)

25

The simplest form of content analysis is based on NER

26

More about
named
entities:
cooccurrence
Source:
Discussion
boards
similar to
blogs,
but (more)
clearly
communicationrelated

28

29

Cooccurrence
of brands
and
attributes

30

Capturing online buzz:


buzz
Bursty communication actitivies

31

Comparing
search volume,
news
and
blogs

32

More advanced text modelling:


Summarization of time-indexed documents
Recall Michelle Obama

Google Trends, Blogpulse etc. associate documents /


document sets with bursts

But: this means the user has to read the documents!

Can we do better and create a concise summary of what was


discussed in that period?

Can we allow the user to ask as much detail as s/he is


interested in?

33

34

Yes
with
STORIES

Salient story elements


1.

Identify content-bearing terms (e.g. 150 top-TF.IDF over whole corpus)

2.

Split whole corpus T by atomic time period (e.g., week)

3.

For each time period (atomic or moving-average)

Compute the weights for corpus t for this period

Weight =

Support of co-occurrence of 2 content-bearing terms w1, w2 in t =


(# articles from t containing both w1, w2 in window) / (# all articles in t)
4.

Threshold

Number of occurrences of co-occurrence(w1, w2) in t 1 (e.g., 5)

Time-relevance TR of co-occurrence(w1, w2) =


support(co-occurrence(w1, w2)) in t / support(co-occurrence(w1, w2)) in T
2 (e.g., 2)

5.

Thresholds are set dynamically + interactively by the user

Story elements = relationships = all these edges


Story basics = terms = all nodes connected by these edges

37

Salient story stages, and story evolution


6.

Story stage = the story graph made of basics and elements in t

7.

Story evolution = how story stages evolve over the t in T

38

An event: a missing child

39

A central figure emerges in the police investigations

40

Uncovering more details

41

Uncovering more details

42

An eventless time

43

The story and the underlying documents

44

Navigating between documents; relating different


source types to one another

45

46

A simple form of opinion mining:


Feature-based Summary

Source: Product reviews


similar to blogs, but (more)
clearly product-related

(Hu and Liu, Proc. SIGKDD04)

GREAT Camera., Jun 3, 2004

Feature1: picture

Reviewer: jprice174 from Atlanta, Ga.

Positive: 12

I did a lot of research last


year before I bought this camera... It
kinda hurt to leave behind my
beloved nikon 35mm SLR, but I was
going to Italy, and I needed
something smaller, and digital.
The pictures coming out of
this camera are amazing. The 'auto'
feature takes great pictures most of
the time. And with digital, you're not
wasting film if the picture doesn't
come out.
.

The pictures coming out of this camera


are amazing.
Overall this is a good camera with a really
good picture clarity.

Negative: 2
The pictures come out hazy if your hands
shake even for a moment during the
entire process of taking a picture.
Focusing on a display rack about 20 feet
away in a brightly lit room during day
time, pictures produced by this camera
were blurry and in a shade of orange.
Feature2: battery life

Agenda
Concepts
From online buzz to text mining
Text mining: first steps
Text mining: going deeper
Closing the loop: from blogs to blogs
The representativeness challenge
So ...

47

An application: Crisis PR Step 1:


Use blogs to observe public discussions
Detect products about which there is controversial discussion

sentiment mining from text

and/or

use the structure of blogs (e.g., structure of blog post +


comments; Mishne & Glance, Proc. WWW 2006)

and/or

discussion in the mainstream media (may be later though)

48

An application: Crisis PR Step 2:


Use blogs to communicate facts + own concerns

Example Dells exploding laptops product recall and aftermath

Dell launched a blog at that time (much maligned at first, but they
learned ...)

Evaluation of all English-language consumer commentary on the


Web before and after (methodology based on Reichhold 1996, The
Loyalty Effect):

49