Académique Documents
Professionnel Documents
Culture Documents
Y. Bengio
L. M. Aiello
R. Schifanella
F. Menczer
Article 33 J.-I. Biel VlogSense: Conversational Behavior and Social Attention in YouTube
(21 pages) D. Gatica-Perez
2011
Volume 7S, Number 1
Y. Bengio
L. M. Aiello
R. Schifanella
F. Menczer
Article 33 J.-I. Biel VlogSense: Conversational Behavior and Social Attention in YouTube
(21 pages) D. Gatica-Perez
ACM
ACM Transactions on 2 Penn Plaza, Suite 701
New York, NY 10121-0701
Communications Home Page: http://tomccap.acm.org/ ACM Transactions on Multimedia Computing, Communications and Applications
http://tomccap.acm.org/
Communications Home Page: http://tomccap.acm.org/ ACM Transactions on Multimedia Computing, Communications and Applications
http://tomccap.acm.org/
The overwhelming amount of Web videos returned from search engines makes effective browsing and search a challenging task.
Rather than conventional ranked list, it becomes necessary to organize the retrieved videos in alternative ways. In this article,
we explore the issue of topic mining and organizing of the retrieved web videos in semantic clusters. We present a framework
for clustering-based video retrieval and build a visualization user interface. A hierarchical topic structure is exploited to encode
the characteristics of the retrieved video collection and a semi-supervised hierarchical topic model is proposed to guide the topic
hierarchy discovery. Carefully designed experiments on web-scale video dataset collected from video sharing websites validate
the proposed method and demonstrate that clustering-based video retrieval is practical to facilitate users for effective browsing.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—
Clustering
General Terms: Algorithms, Design, Experimentation, Performance
Additional Key Words and Phrases: Hierarchical topic model, search result clustering, semisupervised learning, social media,
topic mining, video retrieval
ACM Reference Format:
Sang, J. and Xu, C. 2011. Browse by chunks: Topic mining and organizing on web-scale social media. ACM Trans. Multimedia
Comput. Commun. Appl. 7S, 1, Article 30 (October 2011), 18 pages.
DOI = 10.1145/2037676.2037687 http://doi.acm.org/10.1145/2037676.2037687
1. INTRODUCTION
With the development of multimedia technology and increasing proliferation of social media in Web 2.0,
an overwhelming volume of professional and user-generated videos has been posted to video sharing
websites. YouTube,1 one of the most popular video sharing websites, announced that its users upload
about 65,000 new videos and view more than 100 million videos each day. To detect and track hot events
or topics, more and more people prefer to search and watch videos on the web, which is timely and
1 http://www.youtube.com.
This work was supported by the National Natural Science Foundation of China (Grant No. 90920303) and 973 Program (Project
No. 2012CB316304).
Authors’ address: J. Sang and C. Xu (corresponding author): National Lab of Pattern Recognition, Institute of Automation, CAS,
Beijing 100190 China; email: {jtsang, csxu}@nlpr.ia.ac.cn.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided
that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page
or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to
lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be
requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481,
or permissions@acm.org.
c 2011 ACM 1551-6857/2011/10-ART30 $10.00
DOI 10.1145/2037676.2037687 http://doi.acm.org/10.1145/2037676.2037687
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30
30:2 • J. Sang and C. Xu
Fig. 1. An example page from Youtube for query of ‘9/11 attack’. 7,800 videos are returned. Alternative search options are also
shown.
convenient. With the explosion of shared videos, a heavy demand to provide users an effective way to
retrieve and access videos of interest has emerged. The goal of this work is to offer a novel topic mining
and organizing solution and build a visualization user interface by displaying topics as hierarchical
semantic clusters, which facilitates users browsing the retrieved videos and locating interesting ones.
Conventional video search engines order the retrieved videos according to their relevance to the
query. When a user issues a query, search engines return a ranked list including hundreds or thousands
of matches. Users have to painstakingly browse through the long list to judge whether the results
match their requirements and then locate the interesting videos. One question naturally arises: in
addition to a ranked list, is there any more effective way to organize the retrieved videos?
Clustering and visualizing the returned videos into semantically consistent groups offers alternative
solutions. Clustering the retrieved videos can help users get a quick overview of the retrieved video set
and thus locate interesting videos more easily. YouTube provides several options that allow users to
filter search results by U pload date, Category, Duration and Features (see Figure 1). While the coarse
groups involve generic categories of the videos, they provide users little information to understand
the internal configuration and semantic meaning of the returned video collection. There have also
been research attempts [Liu et al. 2008b; Ramachandran et al. 2009] on employing clustering to assist
video retrieval. The strategy was to build a static clustering of the entire collection and then match the
query to the cluster centroid. This is so-called preretrieval clustering. From the perspective of feature
selection, preretrieval clustering is based on features that are frequent in the whole collection but
irrelevant to the query, whereas post-retrieval clusters are tailored to the characteristics of the query,
which makes use of query-specific features. We cannot assume clustering to play a one-size-fits-all
classification role. Therefore, it is more reasonable to put clustering as a postprocessing step. In this
article we propose a postretrieval web video clustering method for cluster-based video retrieval (see
Figure 2 for illustration).
Our method is illustrated by the following observation. Simply taking a glance at the example
in Figure 1, we find that almost all the returned videos contain words like 9/11, attack, terrorism,
WTC, etc. This phenomenon implies that although diverse topics are involved in the retrieved video
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:3
Fig. 2. Visualization of user interface of cluster-based video retrieval: (top)User submits a query, the underlying topic hierarchy
is exploited and displayed on the left as a complementary view to the conventional flat list. (bottom)When a user chooses one
subtopic (video cluster), the included videos will be shown on the right in the order of its relevance to this subtopic as computed
by Equation (2).
collection, they usually share one common topic referred in query, and we refer to the shared topic
as the parent topic. We elaborate this idea with the same example in Figure 2(top). On the left, the
circle illustrates the latent semantic structure of the retrieved video collection. Each color along the
circle demonstrates a subtopic, annotated with tag-cloud of its top eight probable terms. The length of
the arc is proportional to the number of videos belonging to this subtopic. The parent topic at the
root node of hierarchy is located in the center. Each retrieved video can be viewed as a combination
of the parent topic and one child topic (subtopic, which can be enumerated as Live attack and rescue
video, Domestic and international response afterwards, Investigation and The Else: long-term effect and
memorial).
Delighted from this, we extend the hierarchical topic model [Blei et al. 2010] to exploit a two-level
topic tree in the retrieved video collection and cluster the collected videos into the leaf-level subtopics.
Compared with flat structure based clustering method (e.g. k-means, LDA), utilizing the hierarchical
topic model will prevent the shared topic from being mixed within other topics and thus ensure the
clustering performance. Furthermore, we encode the consistency between the query and the root-level
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:4 • J. Sang and C. Xu
topic (we denote it as the query-root-topic knowledge in this paper), as the prior information to form a
semi-supervised hierarchical topic model.
Since there are no ready metrics for evaluating the performance of cluster-based video retrieval, we
refer to text search result clustering and employ objective metrics as well as user study tasks to assess
the performance of the proposed method.
2. RELATED WORK
In this section, we review the previous researches on Web video mining and search result clustering.
The relations are also discussed.
with interpretation of computer, ipod, logo and fruit [Cai et al. 2004]; query sting with interpretation
of musician, wrestler and film [Hindle et al. 2010]. In this article we focus on more complex queries
concerning political and social events or issues. The semantic clusters inside the returned videos are
diverse aspects of the query-corresponding events (e.g. query of 9/11 attack, see Figure 2) or different
viewpoints on controversial issues (e.g. query of abortion with opposing viewpoints of ‘pro-life’ and ‘pro-
choice’). In this case, limited general terms are insufficient for users to understand the subtopics. It is
best described by a set of representative keywords. In this paper, we introduce topic model to describe
the subtopic with a probability distribution over terms in a large vocabulary.
In addition, illustrated by the observation that the returned results share one common topic, we
explicitly considers the basic characteristic into the clustering process and exploit the inherent hierar-
chical topic structure.
3. FRAMEWORK
In this article, we propose a hierarchical topic model based framework for clustering-based video re-
trieval. The framework contains two steps, query expansion and hierarchical topic model based topic
hierarchy discovery. The input of our algorithm is web videos collected from video sharing websites,
and the output is the generated video clusters as well as the topic hierarchy. This is shown in Figure 3.
When video sharing websites (e.g. YouTube, Metacafe, Vimeo, etc.) capture a query submission from
a user, the search engine will return a raw ranked list of the videos. Metadata around each video are
collected and represented as a document-term matrix.
Hierarchical Latent Dirichlet Allocation (hLDA) [Blei et al. 2010; Blei et al. 2004] is a generalization
of the (flat) Latent Dirichlet Allocation (LDA) model [Blei et al. 2003]. We employ hLDA for unsu-
pervised discovery of the topic hierarchy in the retrieved video collection. To effectively incorporate
query-relevant terms into the root topic, we employ association mining as well as WordNet conceptual
relation between words to expand the query words, resulting in a seed word set. The seed word set
is viewed as supervision information (query-root-topic knowledge) and an extension to the standard
hLDA, semi-supervised hLDA (SShLDA) is proposed to guide the inference of the topic hierarchy.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:6 • J. Sang and C. Xu
After probabilistic inference of topic modeling, each video is assigned a single path from the root
node to a leaf node. The videos assigned to the same path will be grouped together to form a cluster
and the subtopics in the leaf node constitute the description for the corresponding video clusters.
The contributions of this article are summarized as follows: 1) We propose a novel solution frame-
work for clustering-based video retrieval. Hierarchical topic model is introduced to explore the inherent
hierarchical topic structure in the retrieved video collection. 2) Query-root-topic knowledge is incorpo-
rated to guide the topic hierarchy discovery and a semi-supervised extension to the standard hierarchi-
cal topic model is presented. 3) For cluster representation, topics characterized by term distributions
are utilized to deal with complex queries of political and social events or issues.
4. QUERY EXPANSION
Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance
in information retrieval operations. For Web search engines, query expansion involves evaluating a
user’s input and expanding the search query to match additional documents. In our case, we employ
query expansion, combining WordNet and association mining to extend the query terms into a seed
word set S = {s1 , . . . , sC }, which composes the root topic of the derived topic hierarchy.
WordNet [Miller et al. 1990] is an online lexical dictionary which describes word relationships in
three dimensions of Hypernym, Hyponym and Synonym. It is organized conceptually. As in Figure 4,
fight is a hypernym of the verb attack and bombing is a hyponymy of the noun attack. Gong et al.
[2005] utilized WordNet nouns hypernym/hyponym and synonym relation between words to expand
the queries. To avoid bringing in noisy terms, they supplemented their method with a term semantic
network to filter out low-frequency and unusual words. According to our mechanism of incorporating
the supervision information (detailed in Section 5), adding noisy words not included in the vocabulary
will not detract from the topic modeling process. This means we are allowed to extend the query as
much as we can, on condition that no words concerned with subtopics are mixed. Therefore, we exclude
words having hyponym or troponym relations to the query in WordNet. In addition, instead of removing
unusual words, we employ association mining and add high-frequency words into the seed word set.
We utilize WordNet as the basic rule to extend the query along two dimensions including hypernym
and synonym relations. The original query 9/11 attack, for instance, may be expanded to include 911
attack assault aggress assail fight struggle contend onslaught onset attempt operation approach event.
Since WordNet has narrow coverage for domain specific queries [Chandramouli et al. 2008], we use
association rules to exploit collection-dependent word relationships. We examine the vocabulary and
add the words with both top 10 highest conf idence and support with the original query words into the
query expansion. The final seed word set of query 9/11 attack may be S = {911 attack Assault aggress
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:7
Fig. 5. (a) LDA graphical model. (b) Hierarchical LDA graphical model. (c) Semi-supervised Hierarchial LDA graphical model.
μ is the controlling the strength of our constraint derived from the seed set. The proposed SShLDA differs from standard hLDA
in the way w is generated.
assail fight struggle contend onslaught onset attempt operation approach event wtc world trade center
terrorist terrorism 9-11}.
K
P(wi ) = P(wi |zi = j)P(zi = j), (1)
j=1
where P(wi |zi = j) = βi j is a probability of word wi in topic j and P(zi = j) = θ j is a document specific
mixing weight indicating the proportion of topic j.
LDA treats the multinomial parameters β and θ as latent random variables sampled from a Dirich-
let prior with hyperparameters α and η respectively. The corresponding graphical model is shown in
Figure 5(a).
Hierarchical LDA. The LDA model we have described has a flat topic structure. Each document is
a superposition of all K topics with document specific mixture weights. The hierarchical LDA model
organizes topics in a tree of fixed depth L. Each node in the tree has an associated topic and each
document is assumed to be generated by topics on a single path from the root to a leaf through the
tree. Note that all documents share the topic associated with the root node, this feature of hLDA is
consistent with the characteristics of search result collection we mentioned in Section 1.
The merit of the hLDA model is that both the topics and the structure of the tree are learnt from
the training data. This is achieved by placing a nested Chinese restaurant process (nCRP) [Teh et al.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:8 • J. Sang and C. Xu
2006] prior on the tree structure. nCRP specifies a distribution on partitions of documents into paths
in a fixed depth L-level tree. To generate a tree structure from nCRP, assignments of documents to
paths are sampled sequentially, where the first document forms an initial L-level path, i.e. a tree with
a single branch. The probability of creating novel branches is controlled by parameter γ , where smaller
values of γ result in a tree with fewer branches.
In the hLDA, each document is assumed drawn from the following process.
i. Pick a L-level path cd from the nCRP prior: cd ∼ nC RP(γ ).
ii. Sample L-dimensional topic proportion vector θd ∼ GEM(m, π ).
iii. For each word wd,n ∈ wd:
(a) Choose level zd,n ∈ {1, . . . , L} ∼ Discrete(θd);
(b) Sample a word wd,n ∼ Discrete(βcd |zd,n), which is parameterized by the topic in level zd,n on the
path cd.
The corresponding graphical model is shown in Figure 5(b). Further details of hLDA can be found in
Blei et al. [2010].
5.2 Semi-Supervised Hierarchical LDA Model
When we utilize hierarchical topic model for the video clustering task, one subtopic corresponds to
one cluster. The cluster membership of each video is decided by its posterior path assignment cd. The
cluster videos are sorted by their proportion on the subtopic as computed by:
wd,n∈wd |zd,n = 2|
, (2)
Nd
where | · | is indicator function and the numerator accumulates the word allocated at the leaf level, Nd
denotes the word number.
To incorporate the query-root-topic knowledge into the hierarchical topic modeling, we propose an
extension to the standard hLDA, which we call Semi-Supervised Hierarchical LDA model (SShLDA).
The supervised information we add is the seed word set derived from query expansion, S = {s1 , . . . , sC }.
We jointly model the documents and the seed word set, in order to guide the discovery of topic hierarchy
so that the words in the seed word set will have high probability in the root topic and low probability
in subtopics.
We first explain how query-root-topic knowledge can be incorporated into the topic modeling process.
In the standard hLDA, the topic level allocation zd,n for word n in document d is a latent variable and
needs to be inferred through the model learning process. Assume we have the supervised information
of zd,n, that is, the topic level allocation for a given word in a given document. This can be seen as
similar to semi-supervised learning with labeled features [Druck et al. 2008]. In our case, we denote
it as hard constraint when the seed set words are restricted to be shown only in the root topic. In
practical applications, each word tends to be generated from every topic with different probabilities.
Therefore, we relax this strong assumption. Instead of providing topic level allocation zd,n for each seed
word, we modify the generative process of standard hLDA so that sampling seed words from root topic
and subtopics will have different probabilities.
Specifically, the proposed SShLDA differs from hLDA in the way wd,n is generated. The generative
process of SShLDA is:
i. Pick a L-level path cd from the nCRP prior: cd ∼ nC RP(γ ).
ii. Sample L-dimensional topic proportion vector θd ∼ GEM(m, π ).
iii. For each word wd,n ∈ wd:
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:9
Fig. 6. A state of the Markov chain in the Gibbs sampler for the title and tag of “Mossad follow up - start asking questions
why this isnt being exposed.” The document is associated with a path cd through the hierarchy, and each node in the hierarchy
is associated with a distribution over words. Finally, each word wd,n in the title and tag is associated with a level zd,n in the
path cd , with 1 being the root level and 2 being the leaf level. Other words without level allocations are removed as stop − words
in preprocessing. As the constrained sampling proceeding, seed words like 911, attack, terrorism, etc. tend to be more and more
likely generated from the root topic.
We now incorporate the supervision of seed word set. We set a soft constraint by modifying the Gibbs
sampling process that seed words tend to be generated from the root topic (zd,n = 1):
where the definition of Constraint(μ, zd,n) is in Equation (3). Following this sampling process, the words
relevant to the query are guaranteed to have a higher probability to be assigned the root topic, leaving
the subtopics focusing more on refined terms. We emphasize that SShLDA accommodates when derived
vocabulary V does not include the terms in the seed set.
The second term in Equation (4) is a distribution over levels which is concerned with the GEM
distribution of the stick breaking process. We keep it unchanged:
Sampling Path Assignments. Keeping the level allocation variables z fixed, we re-sample the path
assignment associated with each document cd, which will result in a deletion/creation of a branch in
the tree. This is same as the standard hLDA [Blei et al. 2010].
where the first term is the prior on paths implied by nCRP, and the second one is the probability of the
data given a particular choice of path.
With these conditional distributions, the full Gibbs sampling process is specified. Given current
state of the sampler, {c(t) (t)
1:D , z1:D }, we iteratively sample each variable conditioned on the rest. After
running for sufficiently iterations, we can approach its stationary distribution, which is the conditional
distribution of the latent variables in the SShLDA model given the corpus and seed word set.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:11
6. EXPERIMENTS
Among the different metadata around a video, title, tag and description are more likely to be infor-
mative in revealing the semantic meaning. There may be possibilities for mining other metadata (e.g.,
comments), but we leave it for future research. We present two experiments to demonstrate the perfor-
mance of the proposed clustering-based video retrieval framework. First, we refer to text search result
clustering and evaluate subtopic reach time with state-of-the-art algorithms on a benchmark dataset.
Then we consider assessing the retrieval effectiveness in a web-scale video dataset collected from video
sharing websites.
6.1 Dataset
Text subtopic retrieval dataset. We utilized a benchmark text search result clustering evaluation
dataset, AMBIENT.2 AMBIENT consists of 44 topics, each with a set of subtopics and a list of 100
search results with corresponding subtopic relevance judgments. The topics were selected from the list
of ambiguous Wikipedia entries. The 100 search results associated with each topic were collected from
Yahoo, and their relevance to each of the Wikipedia subtopics were manually assessed.
Video sharing Web site dataset. Since the goal of this paper is to present a clustering-based browsing
algorithm for Web video retrieval, it is important to devise methods for evaluating its performance
in real video sharing websites. After careful examination of the hottest topics in Youtube, Google
Zeitgeist, and Twitter, we selected seven social and political topics as queries. We issued these queries
to Youtube, Metacafe, and Vimeo, and crawl the top 500, 150 and 150 (if there are) returned videos for
experiments, respectively. We focused on the topmost search results to avoid bringing too many unre-
lated videos. Videos with no tags are filtered out. The videos collected from each query form a video
set. The queries and information about corresponding video set are listed in Table I.
6.2 Parameter Settings
The work in Hindle et al. [2010] (we refer it as BCS) has a similar motivation, but our work differs from
theirs in several aspects: 1) BCS employs a flat-structure clustering algorithm; 2) BCS uses the cluster
centroid to represent the cluster and provides no mechanism for how to derive the cluster labels. Since
this is the most relevant work with us, we performed their method on our dataset as a comparasion.
The most important parameters for BCS are the weights for adopted features, visual, tag, title, and
description. Affinity propagation (AP) and normalized cut (NC) are utilized as the clustering algorithm
and they demonstrated AP generally outperforms NC. Therefore, we fixed the set of feature weights
showing best performance with AP clustering: visual-0.3, tag-0.49, title-0.07, description-0.14.
To further evaluate the advantage of exploiting a hierarchical topic structure, we also implemented
LDA and compare it with hLDA and SShLDA. Topic models make assumptions about the topic struc-
ture by the settings of hyperparameters. We empirically fixed the hyperparameters according to the
2 http://credo.fub.it/ambient.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:12 • J. Sang and C. Xu
prior expectation about the data. The hyperparameter η controls the smoothing/sparsity of topic-word
distribution. Small η encourages more words to have high probability in each topic. (For LDA, it re-
quires less topics to explain the data. For hLDA and SShLDA, it leads to a small tree with compact
topics.) Delighted from this, we empirically chose a relatively small value of η and set η = 0.5. Both
hLDA and SShLDA have an additional hyperparameter, CRF parameter γ , which decides the size of
the inferred tree. As in Blei et al. [2004], we set γ = 1 to reduce the likelihood choosing new paths
when traversing nCRP.
Dirichlet prior hyperparameter α for LDA and the GEM parameters m, π for hLDA and SShLDA
jointly control over the mixing of document-topic distribution. For LDA, our goal is to group documents
into topic-specific clusters according to the dominant topic proportions. Therefore, α is fixed to a value
much larger than 1 (α = 50) to encourage high mixing of topics. For hLDA and SShLDA, GEM pa-
rameters m, π reflect the stick-breaking distribution. We set m to be a small value (m = 0.1), and the
posterior is more likely to assign more words to the leaf level of the inferred tree. Setting variance π
to be a small value (π = 10) means that the word allocation adheres to the parameter settings, thus
accelerates the convergence speed.
For the choice of supervision strength parameter μ, we divided the AMBIENT dataset into two sub-
sets: one consisting of 10 topics for the determination of μ and one consisting of 34 topics for evaluating
the clustering performance. We assume that appropriate μ brings no perturbation to the hierarchical
topic discovery process and the derived topic tree should be consistent with the latent hierarchical
structure. Therefore, we analyzed the error between the subtopic number of ground truth and the de-
rived subtopic number over the different values of μ (see Figure 7). μ = 0.5 achieves the least error.
Therefore, we fixed μ = 0.5 in the following experiments. In fact, we also compared the retrieval per-
formance with respect to various μ in Section 6.4, and found that the performances for the different
queries share a similar variation pattern: the results deteriorate as μ approaches 0 or 1, and there is
little difference when μ ∈ [0.3, 0.6]. Therefore, for practical implementation where a training set is not
available, μ is suggested to set as 0.5.
Table II.
Comparison of subtopic reach time of state-of-the-art text search results
clustering with LDA, hLDA and SShLDA on the AMBIENT text collection.
CREDO Lingo Lingo3G STC TRSC LDA hLDA SShLDA
14.96 15.05 13.11 15.82 17.46 15.73 12.7 10.92
When no appropriate cluster fits the subtopic at hand, or the selected cluster does not contain any
relevant result, SRT is given by the number of clusters plus the position of the first result relevant to
the subtopic in the ranked list.
We noticed that Hindle et al. [2010] adopted visual feature in the clustering process, and it is not
fair to examine it in a text-based AMBIENT database. Therefore, we only compared the results (which
is averaged over the test set of 34 queries) of LDA, hLDA and proposed SShLDA with state-of-the-art
text search clustering algorithms in Table II (the results of text search clustering algorithms are taken
from [Carpineto et al. 2009]). Four graduate students participated in the user study task as subjects.
The best performance is achieved by SShLDA, followed by hLDA, which is due to the separation of
shared common topic from subtopics. It is interesting to note that the SRT for LDA is relatively high.
The topics AMBIENT included are most general terms, e.g. Eos, Cube, B-52. The descriptive power of
topic model for complex queries cannot be exerted.
Fig. 8. (left:)The ground-truth subtopic number and automatically derived cluster number for test queries. (right:)Subtopic
reach time (SRT) for test queries.
BCS noticeably outperforms the other algorithms. High purity is easy to achieve when the number
of clusters is large. Therefore, we cannot use purity to trade off the quality of the clustering against
the number of the clusters. A measure to make this trade-off is F measure [Steinbach et al. 2000]. We
evenly penalize false negatives and false positives, i.e. the F1 measure (Figure 9(right)). It is shown
that BCS performs poorly on F1 measure, even much worse than LDA. The reason is that BCS focuses
on clustering duplicate or near-duplicate videos, which limits the cluster size and forces considerable
number of semantically similar videos assigned to different clusters.
The quality of the cluster description is crucial to the usability of clustering-based video retrieval.
If a cluster cannot be described, it is presumably of no value to the user. BCS employs the cluster
centroid as the cluster representation, which lacks real descriptions and is of litter use for guiding
the user understanding the cluster content and locating the interesting videos. The cluster descrip-
tion readability is evaluated as follows. Each cluster corresponding subtopic characterized by the top 5
probable words was shown to the participants with the top 3 ranked videos in this subtopic. The par-
ticipants were asked to evaluate the cluster description readability in two aspects: “whether the topic
description itself is sensible, comprehensive and compact” (question 1) and “whether the topic descrip-
tion is consistent with the representative videos” (question 2). For each question, participants rated
from 1 to 5 where 5 is best. The average ratings are shown in Figure 10. The proposed SShLDA shows
superiority on generating meaningful cluster descriptions, especially on generating sensible, compre-
hensive and compact representations (question 1). We note that ratings for query 5 Beijing Olympics
are relatively low. In the retrieved video set of Beijing Olympics, diverse events or subtopics are in-
volved, for instance opening ceremony, game video, athlete interview, torch relay, etc. The discovered
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:15
Fig. 10. Mean ratings of cluster description readability for (left:) Question 1 (right:) Question 2.
topic structure is sparse and less meaningful. Besides, some unrelated videos regarding issue of Tibet
are also included.
For clustering-based video retrieval, the clustering is performed online, which requires necessarily
short response time. We focus on the efficiency of clustering algorithms and do not consider about
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:16 • J. Sang and C. Xu
Fig. 13. Discovered subtopics from the video collection of seven queries from Youtube. (a) 9/11 attack, comparison between LDA,
hLDA and SShLDA. For SShLDA, we also present 2 video examples having the largest proportion associated with the topics
(b) gay rights; (c) abortion; (d) Iraq war invasion; (e) Beijing Olympics; (f) Israeli Palestine conflicts; (g) US president election.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
Browse by Chunks: Topic Mining and Organizing on Web-Scale Social Media • 30:17
the video acquisition time cost. We assume that visual features used in BCS are extracted offline and
take no account of text preprocessing time. Table III illustrates time complexity for the clustering
algorithms. (SShLDA1 denotes the clustering time cost only, SShLDA2 also considers the query ex-
pansion time from local-storaged WordNet). Since BCS uses AP for clustering, it achieves lower time
cost than the generative topic models. The speedup of SShLDA over hLDA is due to that incorporated
prior guides the seed words gradually generated from the root set and thus speedups the convergence
process. We noticed that the computational cost dramatically increases when dealing with large-scale
web videos, and we will be researching towards this in future work.
Clustering versus ranked lists. To compare the proposed clustering-based video retrieval with ex-
isting video search engines, for instance, Youtube, we design a specific task. The task assumes the
participant is a news editor and wants to allsidedly introduce a hot event or topic to users, search for
10 Web videos. Participants choose Youtube or the proposed clustering-based interface to complete the
task in a random order. After the task, participants are required to select from four options for both
systems. The options are very satisfied (4), somewhat satisfied (3), unsatisfied (2) and very unsatisfied
(1). The average ratings are shown in Figure 11. For five out of seven test queries, participants prefer
the proposed clustering-based method to ranked list-based search engine.
Clustering performance with respect to strength parameter μ. To analyze the influence of the strength
parameter μ to the clustering performance, we performed an experiment to evaluate the SRT by tuning
μ ∈ [0, 1] at a step of 0.1. With the results illustrated in Figure 12, we come up with three observations:
1) As μ changes from 0 to 1, the retrieval performance of different queries varies similarly, with query
1 varies slightly different. A rough conclusion is that different datasets share a unique pattern of
choosing μ. 2) The results deteriorate dramatically when μ = 1, which verifies our assumption that a
hard constraint is not practical. 3) While the results deteriorate as μ approaches 0 or 1, there is little
difference when μ ∈ [0.3, 0.6]. This means that the incorporation of prior knowledge is effective and
our algorithm does not heavily depend on the choice of the strength parameter μ. A chart of subtopics
is given in Figure 13.
7. CONCLUSIONS
In this article, we have presented a hierarchical topic model based framework for clustering-based web
video retrieval. Instead of showing a long ranked list videos, we explore the hierarchical topic struc-
ture in the retrieved video collection and present users with videos organized into semantic clusters.
Experiments demonstrate the effectiveness of the proposed method.
In the future, we will improve our current work along three directions. 1) Unrelated videos in re-
trieved video collections will affect the clustering performance. We will develop noisy subtopic aware
hierarchical topic model to reduce the influence of noises as well as remove unrelated videos. 2) Some
summary videos cover various aspects of query related topic, for instance, an introductive video de-
scribes 3 main viewpoints towards the issue of abortion: pro-life, pro-choice and neutral. In this case,
the video cannot be grouped into arbitrary subtopic. The SShLDA needs to be extended to multipath
assignment version: each document exhibits multiple paths through the tree and topic depth L can
vary from document to document. 3) So far our experiments have been based on textual analysis and
consider no visual information. Web videos carry rich visual contents and visual information provides
important clues for video clustering that should not be ignored. We are now working towards incorpo-
rating visual information into the hierarchical topic modeling framework.
REFERENCES
BISHOP, C. M. 2006. Pattern Recognition and Machine Learning. Springer.
BLEI, D., NG, A., AND JORDAN, M. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 7, 993–1022.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.
30:18 • J. Sang and C. Xu
BLEI, D. M., GRIFFITHS, T. L., AND JORADAN, M. I. 2010. The nested chinese restaurant process and bayesian nonparametric
inference of topic hierarchies. J. ACM 57, 2, 1–30.
BLEI, D. M., GRIFFITHS, T. L., JORADAN, M. I., AND TENENBAUM, J. 2004. Hierarchical topic models and the nested chinese restaurant
process. In Advances in Neural Information Processing Systems. MIT Press, 17–24.
CAI, D., HE, X., LI, Z., MA, W. Y., AND WEN, J. R. 2004. Hierarchical clustering of www image search results using visual textual
and link information. In Proceedings of the ACM Multimedia Conference (MM). 952–959.
CAO, J., NGO, C.-W., ZHANG, Y.-D., ZHANG, D.-M., AND MA, L. 2010. Trajectory-based visualization of web video topics. In Proceed-
ings of the ACM Multimedia Conference (MM). 1639–1642.
CARPINETO, C., OSINSKI, S., ROMANO, G., AND WEISS, D. 2009. A survey of web clustering engines. ACM Comput. Surv. 41, 3, 1–38.
CHANDRAMOULI, K., KLIEGR, T., NEMRAVA, J., SVATEK, V., AND IZQUIERDO, E. 2008. Query refinement and user relevance feedback
for contextualized image retrieval. In Visual Information Engineering, Xian, China, 452–458.
CHEUNG, S. S. AND ZAKHOR, A. 2004. Fast similarity search and clustering of video sequences on the world-wide-web. IEEE
Trans. Multimedia 7, 3, 524–537.
CUTTING, D. R., PEDERSEN, J. O., KARGER, D. R., AND TUKEY, J. W. 1992. Scatter/gather: a cluster-based approach to browsing
large document collections. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR). 318–329.
DRUCK, G., MANN, G., AND MCCALLUM, A. 2008. Learning from labeled features using generalized expectation criteria. In Pro-
ceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR).
595–602.
GONG, Z., CHEANG, C. W., AND U, L. H. 2005. Web query expansion by wordnet. In Proceedings of the International Conference on
Database and Expert Systems Applications (DEXA). Springer-Verlag, 166–175.
HINDLE, A., SHAO, J., LIN, D., LU, J., AND ZHANG, R. 2010. Clustering web video search results based on integration of multiple
features. In Proceedings of the International World Wide Web Conference (WWW), 1–21.
JING, F., WANG, C., YAO, Y., DENG, K., ZHANG, L., AND MA, W. Y. 2006. Igroup: web image search results clustering. In Proceedings
of the ACM Multimedia Conference (MM). 377–384.
KUMMAMURU, K., LOTIKAR, R., AND ETZIONI, O. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the
21st International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 46–54.
LIU, J. 1994. The collapsed gibbs sampler in Bayesian computations with application to a gene regulation problem. J. Amer.
Stat. Assoc. 89, 958–966.
LIU, L., RUI, Y., SUN, L.-F., YANG, B., ZHANG, J., AND YANG, S.-Q. 2008b. Topic mining on web-shared videos. In Proceedings of the
International Conference on Acoustics, Speech, and Signal Processing (ICASSP). 2145–2148.
LIU, L., SUN, L.-F., RUI, Y., SHI, Y., AND YANG, S.-Q. 2008a. Web video topic discovery and tracking via bipartite graph reinforce-
ment model. In Proceedings of the International World Wide Web Conference (WWW). 1009–1018.
MILLER, G. A., BECKWITH, R., FELBAUM, C., GROSS, D., AND MILLER, K. 1990. Introduction to WordNet: An On-line Lexical Database.
Vol. 3. Oxford University Press.
RAMACHANDRAN, C., MALIK, R., JIN, X., GAO, J., AND HAN, J. 2009. Videomule: a consensus learning approach to multi-label
classification from noisy user-generated videos. In Proceedings of the Multimedia Conference (MM).
STEINBACH, M., KARYPIS, G., AND KUMAR, V. 2000. A comparison of document clustering techniques. In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. 35–42.
TAN, P., STEINBACH, M., AND KUMAR, V. 2005. Introduction to Data Mining. Vol. 19. Addison Wesley.
TEH, Y. W., JORDAN, M. I., BEAL, M. J., AND BLEI, D. M. 2006. Hierarchical dirichlet processes. J. Amer. Stat. Asso. 101, 476,
1566–1581.
WU, X., HAUPTMANN, A. G., AND NGO, C.-W. 2007. Practical elimination of near-duplicates from web video search. In Proceedings
of the ACM MultiMedia Conference (MM). 218–227.
YUAN, J., LUO, J., AND WU, Y. 2010. Mining compositional features from gps and visual cues for event recognition in photo
collections. IEEE Trans. Multimedia 12, 7, 705–716.
YUAN, J., MENG, J., WU, Y., AND LUO, J. 2008. Mining recurring events through forest growing. IEEE Trans. Circuits Syst. Video
Techn. 18, 11, 1597–1607.
ZAMIR, O. AND ETZIONI, O. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the 21st International
ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 46–54.
ACM Transactions on Multimedia Computing, Communications and Applications, Vol. 7S, No. 1, Article 30, Publication date: October 2011.