Académique Documents
Professionnel Documents
Culture Documents
A R T I C L E I N F O A B S T R A C T
Article history:
Received 17 July 2015 Social media is a major platform for opinion sharing. In order to better understand and exploit opinions
Received in revised form 10 May 2016 on social media, we aim to classify users with opposite opinions on a topic for decision support. Rather
Accepted 5 June 2016 than mining text content, we introduce a link-based classication model, named global consistency
Available online xxx maximization (GCM) that partitions a social network into two classes of users with opposite opinions.
Experiments on a Twitter data set show that: (1) our global approach achieves higher accuracy than two
Keywords: baseline approaches and (2) link-based classiers are more robust to small training samples if selected
Big data properly.
Social media
2016 Elsevier B.V. All rights reserved.
Opinion mining
Collective classication
http://dx.doi.org/10.1016/j.im.2016.06.004
0378-7206/ 2016 Elsevier B.V. All rights reserved.
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
online postings and informal writing styles. Furthermore, with a 2.1. Content-based approaches
limited number of seed users as training data, content-based
opinion classiers often fail to achieve high classication Opinion classication is often based on textual contents, which
accuracy. can take two approaches: lexicon-based and learning-based
Enlightened by the variety nature of big data, our study takes a methods. Lexicon-based approaches use lexicons and predened
different perspective to address the opinion classication problem. rules to annotate sentiments of text [13]. For example, Demers and
As we know, the network structure of users interacting with each Vega [14] used a lexicon approach to measure the tone of news.
other in social media contains rich information suggesting opinion Learning-based methods use machine-learning techniques upon
groups. How opinions are formed, divided, and spread is often linguistic features, including the lexicon features if available, to
guided by the principles of homophily and social inuence. In build opinion classication models. For example, Yang et al. [15]
microblogging, as participants receive messages only from those applied association rules and a Nave Bayes (NB) classier to
who they choose to follow, it is highly unlikely that they will follow classify the sentiments of online consumer reviews on
a person whom they do not like or care about. Hence, the following e-commerce websites.
relationships among users may suggest a layer of homophily that With the rise of social media, opinion mining is widely applied
can be exploited to identify opinion camps. Another valuable to social media applications such as Twitter and Facebook. Most
information source for opinion mining is the retweeting relation- existing studies used the textual contents to resolve this problem.
ships among users. According to studies investigating the Mostafa [16] used a lexicon-based approach to classify brand
motivation behind online activities, microblog participants usually sentiments. Li and Xu [17] constructed a rule-based system to
retweet a message because they think it could help others make detect the events in microblog posts that cause emotional effects.
decisions [11]. They may also retweet messages to stay connected In the machine-learning approach camp, Sayeedunnissa et al. [18]
with others who receive the message or show support to the took a classic NB approach with the bag-of-words model and
sender of the message [12]. As a result, in the context of information gain-based feature selection to classify Twitter
microblogging, people are less likely to retweet a message if they sentiments. Akaichi et al. [19] and Hamouda and Akaichi [20]
disagree with the content of the message or dislike the sender of used support vector machine (SVM) and NB approach to classify
the message. Facebook status sentiments. Myslin et al. [21] compared multiple
On the basis of social inuence theories, link structures among classic machine-learning approaches to classify Twitter
users could also suggest users opinion. Most existing studies take a sentiments on tobacco products. It is worth noting that Hassan
local view and classify each users opinions separately. In such et al. [22] proposed a bootstrapping ensemble approach to address
approaches, early errors may propagate to later classications and the Twitter sentiment classication problem. The approach
thus lead to lower accuracy. We propose a novel method that takes provided more accurate and balanced predictions and built
advantage of the global structure of social interactions to alleviate sentiment time series that better reect events eliciting strong
the opinion classication problem in a collective manner. In sentiments from users. There were also studies combining
particular, we model the individuals involved in a topic discussion machine-learning with lexicon-based approaches to classify
as a graph according to their communication linkages. Our sentiments [23,24]. In the machine-learning approach, clustering
conjecture is that this graph reects stabilized social relations methods were also used to analyze social media opinions. For
where people with common opinions tend to interact more and example, Paltoglou and Thelwall [25] proposed an unsupervised
people with opposite opinions tend to communicate less. On the lexicon-based approach that estimates the level of emotional
basis of this conjecture, we propose a global consistency intensity contained in text. Feng et al. [26] used PLSA to cluster
optimization algorithm to collectively classify the opinions of blogs with sentiments as a latent variable, which was able to nd
users involved in debates in social media. Our research shows that sentiment coherent groups.
our proposed opinion classier is not only accurate but also robust
to the size of seed users as long as the seeds are chosen 2.2. Link-based approaches
appropriately.
We consider our study a novel and signicant contribution to Social media provides a new platform for users to communicate
the design of analytical solutions in electronic commerce. Our and interact with each other. When a user posts a message online
work focuses on analyzing the large volume of everchanging to express his/her opinion, it can incur a series of responses such as
opinion data generated by users in social media. In particular, our compliments, praises, disagreements, and even attacks from other
proposed approach takes advantage of the variety of social media users. Each of these responses can lead to even more responses in a
data by leveraging the link relationships among users to achieve spreading manner. Such response relationships form a social
highly accurate classication of opinions. Furthermore, the global network of opinionated users. The linkage (i.e., relationship)
maximization algorithm shows signicant robustness to the size of information has become new evidence that can help distinguish
training set. This strength of our approach can help mitigate a users opinions. Because users are connected to each other in a
common problem in big-data analytics, that is, the limited amount social network and their class labels are intercorrelated, opinion
of labeled data. Our research has signicant implications for both mining has become a collective classication problem [27].
researchers and practitioners of big-data analytics in electronic Many studies have tried to use the linkage information
commerce. embedded in social networks for classifying users opinions in
social media. However, they view the semantics of linkages in
social networks differently. In an early effort of link-based opinion
2. Literature review
classication, Agrawal et al. [28] analyzed a social network formed
by respond-to relationships in newsgroups. By assuming that a
Opinion mining is an important problem in text mining [10],
respond-to relationship represents disagreement, they use a
which has been examining the subjectivity and polarity of text. In
max-cut graph-partitioning algorithm to break highly weighted
the context of this research, we are more interested in the
disagreeing edges and separate users into two groups. Their
classication of opinion polarity, that is, positive or negative. In
assumption of respond-to indicating disagreement may not hold
existing literature, related work on opinion classication can be
in other popular social media platforms. On Twitter, following or
divided into content-based and link-based approaches.
retweeting someone is more likely to mean that the users agree
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
with each other or share a similar opinion. Therefore, more studies of real-world applications such as project scheduling [41], gene
on link-based opinion mining are based on the homophily function prediction [42], and image segmentation [43]. Unlike LP,
assumption [29], that is, the phenomenon of birds of a feather the min-cut method takes a global view and nds an optimal
ock together [30]. Users who are connected by a mutual partition of the entire graph to reach maximum consistency in the
relationship are more likely to share common opinions. In the two subgraphs. To the best of our knowledge, no prior study has
context of opinion classication on a social media platform like used the min-cut approach to classify users opinions on Twitter.
Twitter, researchers have investigated users mutual relationships
in a variety of forms, including follow [3134] mention [32,35], 2.3. Research questions
or retweet [3537]. In particular, using @ mentions, as a way of
creating connections on Twitter, may also indicate a desire to pay Among the state-of-the-art techniques for opinion classica-
attention (e.g., to information of interest). Among the homophily tion, content-based approaches (e.g., NB classiers) [1820] ignore
connections, several studies suggested that the follow links had the rich information of social linkages among users, while
little positive impact on classication accuracy [31]. Conover et al. link-based approaches (e.g., LP) [31,35,39] exploit linkage infor-
[35] found that combining attention links with follow links is mation but take a local view. Our study, based on the homophily
superior to using follow links alone. Wong et al. [36] also argue that assumption and taking a global view, is aimed at modeling users
following (a tweeter) is not a robust indicator of approval or opinions collectively in the entire structure of the network. In this
agreement on political opinions. A user may follow two sources on study, we investigate whether such a global optimization approach
Twitter with opposite political stances to obtain a more unbiased can outperform the state-of-the-art opinion classiers in terms of
comprehensive view. A follow link may exist as a stale edge in the accuracy.
Twitter following network, simply because a user forgets to Furthermore, traditional content-based opinion classication
unfollow a prominent tweeter whom he/she is no long interested methods generally require a training data set with a reasonably
in. By contrast, retweeting is often an explicit act of approval and large number of labeled instances. Data annotation is known to be
therefore a stronger evidence for agreement in opinions. a tedious task that requires a large amount of time, effort, and
On the basis of the homophily assumption, a variety of domain expertise. Given the enormous volume of data on social
analytical techniques have been developed for opinion classica- media platforms, how many data instances (e.g., tweets) do we
tion. One of the most popular techniques used in several studies is need to review and label for training to guarantee the potential
label propagation (LP), which analyzes the labels in a nodes capability of the classier? We are interested in determining the
neighborhood and tries to assign a label to each node in an iterative robustness of opinion classiers to the training set size and the
manner. Ren et al. [38] used this method for determining class method of choosing the most important data instances for training
labels of customer reviews based on a graph consisting of links opinion classiers. In particular, this study is aimed at addressing
representing similarity between nodes. In order to predict the the following research questions:
political alignment of Twitter users, Conover et al. [35] studied two Q1. Can a collective classier based on global optimization
types of communication networks, based on mention edges and outperform existing models for opinion classication?
retweet edges, and proposed a solution of community detection Q2. How robust are opinion classiers to different sizes of training
using an LP algorithm [39]. Speriosu et al. [31] developed an data sets?
opinion polarization approach by combining both lexical links (e.g., Q3. What is the best strategy to choose seed users as training data
text, hash tags, and emotions) and following links in a graph. A for opinion classication?
semi-supervised LP algorithm was used for classication with
different seeding methods. In addition, for tweet-level opinion 3. Model
classication, Rajadesingan and Liu [37] made an assumption that
two tweets being retweeted by the same users within a short time Debates in social media are often a dynamic process involving
period are likely similar in terms of opinion. They introduced a heated discussion between two sides. The involved users, however,
graph-based algorithm, namely ReLP (retweet label propagation) may only interact with a portion of users to express their opinion.
that starts with a set of seeds and iteratively propagates labels to In circumstances where one feels that his/her opinion is fully
similar tweets. Tan et al. [32] proposed an opinion polarization expressed, he/she may not generate much content. In such
approach by incorporating social network information such as circumstances, predicting each users opinion by analyzing his/
follow and mention linkages. Their graph-based classication her postings and relationships separately, as in most existing work,
models consist of loopy belief propagation to infer user-level may not be the best solution. Rather, we introduce a novel
sentiment labels. Rabelo et al. [33,34] applied a relational classier approach that takes into account the interdependencies between
combined with a relaxation labeling algorithm on a Twitter users and classies user opinions in a collective manner. Our
follower network to collectively predict political polarity of users. approach builds an undirected graph to represent users involved in
All the aforementioned studies showed that classication models an online debate. On the basis of the homophily assumption, we
based on links in a graph outperformed traditional content-based model opinion classication as a global consistency maximization
opinion classiers. (GCM) problem [42]. Our collective opinion classier can nd an
Most of these existing collective opinion classiers, such as LP optimal solution by partitioning the graph into two components,
[35] or relaxation labeling [33,34], take a local view and label each each representing one side of the debate.
node based on the labels of its direct neighbors. Such a labeling
process iterates in the graph until the solution converges, which 3.1. Problem formulation
may be highly computationally expensive. More importantly,
propagating labels based on dependencies in local neighborhoods We represent a social network as an undirected graph: G(V, E).
may not necessarily lead to a global optimal solution. In such a graph, V:{v1, v2, . . . , vn} is a set of n vertices/nodes, in
In optimization theory and graph theory, minimum cut which each node represents an online user. E:{eij} is a set of
(min-cut) is a combinatorial optimization problem that partitions edges/links, in which each link eij represents a certain relationship
the vertices of a graph into two disjointed subsets such that the (s) between two users i and j. The users discuss certain topics. For a
total capacity of the removed edges is minimum [40]. The min-cut specic topic, a user i may hold an opinion state xi. We assume only
method and its dual, max-ow, have been widely used in a number two possible opinion states in this study and set xi = 1 if user i is a
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
1
3.2. Global consistency maximization
s 0 -1 t
On Twitter, two nodes i and j can have different relationships/ source 1 sink
interactions, for example: 0
Following: user i follows user j, which indicates that i is (b) Directed graph H
interested in j.
Retweeting: user i retweets messages by user j, which often for against to be determined
indicates that i is distributing js opinion, possibly adding his/her
Fig. 1. Transforming an Undirected Graph G into Directed Graph H.
own opinion on the topic.
Commenting/replying: user i comments on messages posted by
user j, which indicates that i is expressing an opinion to either For each node i in G with xi = 1, we create a direct edge from
agree or disagree with j. node i to node t with wit = 1.
Liking/favoriting: user i likes messages by user j, which often For each edge in G connecting node i and node j such that xi = 1
indicates that i agrees with j. and xj = 0, we create a direct edge from node i to node j with edge
weight wij copied from G.
Each of these relationships between i and j can be regarded as For each edge in G connecting node i and node j such that xi = 0
evidence of (dis)agreement between the two users. All the and xj = 1, we create a direct edge from node i to node j with
evidence can be aggregated to an agreement score wij (associated edge weight wij copied from G.
with each edge eij) of their opinions on the topic. It is important to For each edge in G connecting node i and node j such that xi = 0
note that wij can be either positive (agreement) or negative and xj = 0, we create a direct edge from node i to node j and a
(disagreement). In this study, we focus only on retweeting direct edge from j to i, with wij = wji copied from G.
relationships. Retweeting is often considered an explicit act of The edges in G that are not incident on a node in state 0 are
approval and a more robust indicator of agreement than other ignored.
relationships such as following [36]. Like most related work
[3537], our approach is based on the homophily assumption, that With this new directed graph H, classifying opinions of users
is, when a user i retweets another user j, they tend to share the is converted to a min-cut problem. An st cut in graph H is a
same opinion on this topic. Therefore, the agreement score wij partition of the nodes of H into two sets S and T, where S
between the two users is proportional to the number of retweets contains the source node s and T contains node t. An edge (u, v)
between them (either i retweeting j or j retweeting i). crosses the cut if u lies in S and v lies in T. The weight of the cut
For each pair of users i and j with opinion states xi and xj, we is the sum of the weights of the edges crossing the cut. Because
dene wij xi xj as a consistency score of the edge connecting i and j. a high weight represents a high degree of agreement between
As wij > 0 indicates they are likely to agree each other, we generally two users, separating two highly agreeing nodes on two sides is
want to set xi and xj to be the same (either 1 or 1). Accordingly, if considered a reduction in consistency. Therefore, our objective is
wij < 0, we generally expect the two disagree, and xi and xj have to nd the st cut C with the smallest weight in H. For each node
opposite values. Thus, at the social network level, we argue that the i in S, we set xi = 1; for each node i in T, we set xi = 1. Because
optimal opinion state assignment should provide us the highest we have set the weights of edges incident on s and t (edges of
overall consistency across the network: types 1 and 2) to be innity, they will not be selected to
XX participate in C. Hence, the only edges in C are those incident on
MaximizeE wij xi xj nodes with state equal to 0. Each edge in C of types 3 or 4
i j corresponds to one inconsistent edge in G, an edge between two
s:t:xi ; xj f1; 1; 0g users who retweeted one another but have opposite opinions. In
In order to solve this optimization problem, we construct a new each pair of edges of type 5, at most one edge can belong to C;
directed graph H from the undirected graph G (which has users according to the denition of the st cut, the edge in the pair
with unknown opinion states). Each node of H corresponds to a directed from a node in T to a node in S does not belong to the
node of G. H also contains two new nodes: a source node s (for) cut. Therefore, each edge in C corresponds to exactly one
and sink node t (against). Unlike edges in G, each edge in H is inconsistent edge in G in a one-to-one manner. Hence, the cut C
directed. Fig. 1 illustrates how to transform an undirected graph G with the smallest weight gives the optimal state assignment that
(Fig. 1(a)) into a directed graph H (Fig. 1(b)) by creating six different minimizes the total weight of inconsistent edges in G.
types of edges based on the following rules: In the example in Fig. 1, suppose the weights of all edges not
incident on s or t are 1 and the min-cut in H is the edge connecting
For each node i in G with xi = 1, we create a direct edge from node the node in state 0 to the one in state 1. Thus, the algorithm
s to node i with wsi = 1. should assign a state of 1 to both nodes in state 0.
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
Given this data set, we build a graph G that represents the social Among the link-based opinion classiers, LP is one of the most
network of all users. In this graph G, each vertex represents a user popular and effective techniques in several studies [31,35,39]. In
and each edge represents a relationship between two users. This our experiments, we implement an LP algorithm for opinion
study only considers the retweet relationship between users. Thus, classication as a second baseline. Like GCM, LP is based on the
an edge eij between vertices i and j indicates user i retweeted j and/ same homophily assumption, that is, a user tends to share the same
or j retweeted i. The weight of eij, wij, is dened as the number of opinion as his/her neighbors. Fig. 2 shows the pseudo-code for the
retweets between i and j. It is worth noting that 215,669 out of the LP algorithm. The algorithm starts with graph G, including a subset
491,860 users are isolated vertices in the graph, because they did of seeds Vv, that is, visibly opinionated users, whose class labels are
not retweet or were not retweeted by others. Table 1 provides an known. In each iteration, the algorithm traverses all vertices in G in
overall description of this Twitter data set. a random sequence. Each vertex i is assigned the majority label of
its neighbors. In order to determine the majority label, we take into
4.2. Baseline methods and implementation account the weight wij of the edge between vertices i and j, which
can be calculated in the same manner as in GCM. In this study, we
For comparison, we develop two baseline methods for only consider the number of retweets between two users. The
classifying users of opposing opinions.
Input: G including labeled vertices Vv
Output: G with all vertices labeled
Table 1 Procedure:
Data description of the Twitter data set. 1. new TRUE
# users (vertices) 491,860 2. while (new)
# isolated users 215,669 3. new FALSE
# connected users 276,191 4. S random sequence ofall vertices in G
# visibly opinionated users 262 (84 for; 178 against) 5. for each vertex i in S:
# moderately opinionated users 396 (276 for; 120 against) 6. l the majority label of is neighbors
# of postings 916,171 7. ifis label xi != l
# of original tweets 505,637 8. xi l
# of retweets 410,534 9. new TRUE
# edges 384,331
Fig. 2. Pseudo-code of a Label Propagation Algorithm.
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
Table 3
Prediction performance of the three opinion classication methods.
Precision (%) Recall (%) F-measure (%) Precision (%) Recall (%) F-measure (%)
CC 72.47 91.54 66.67 77.15 52.82 85.83 65.40
LP 93.18 94.01 96.74 95.36 92.73 85.00 88.70
GCM 93.43 94.35 96.74 95.53 92.79 85.83 89.18
one with strong opinions and high inuence on a particular tweets method seems to be more robust and can maintain
topic. performance until the training set size is below approximately 10%.
4. Rank by the number of times being retweeted and retweeting This is because the selected users are the most active ones who
others (degree): In a directed graph, that shows the retweeting posted the most tweets, which provides sufcient data instances to
network among users, this number is the degree (in-degree + achieve the best performance possible for this classier.
out-degree) centrality score of each node. Figs. 4 and 5 show the results for the two link-based methods,
LP and GCM. For all four ranking methods, with only minor
For each ranking method, we compare the performance of the uctuations, LP shows consistently high performance for classi-
three opinion classiers (CC, LP, and GCM) by varying the cation, even when only a very low percentage of data instances are
percentage of top-ranked users included in the training set. In included in the training set. In particular, for rank by # retweeted
particular, when the percentage of top users is 100%, all visibly and rank by degree, even if only one positive and one negative
opinionated users were included in the training set. Then, we instance are available in the training set, due to the high
gradually decrease the percentage until only the topmost user connectivity of the graph, LP can still successfully propagate the
from each class is included. Figs. 35 show the average precision, correct class labels to nodes and achieve almost the same
recall, and F-measure scores of robustness tests for the three performance. For rank by # tweets and rank by # followers,
classiers. LP also maintains good classication performance until the
As shown in Fig. 3, as the size of the training set decreases, the training set is reduced below 1.5% (i.e., two instances for each
performance of CC is initially stable in the beginning and starts class) and 1% (one instance for each class), respectively.
decreasing when the percentage reaches about 25%. The irregular GCM achieves slightly better results than LP when 100% of the
shapes of the curves near the left end (percentage < 10%) indicate opinionated users are used as training data. However, the
that the classiers fail due to insufcient training data and robustness of GCM against the training data size varies for
therefore assign all or most instances in the test set to only one of different ranking methods used to select the seed users. In
the two classes. Among the four ranking methods, the rank by # particular, for rank by # tweets and rank by # followers, GCMs
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
classication performance starts dropping drastically when the reliable source for opinion classication. Link-based classiers,
training sets size is below 75%. It is highly likely that some key such as LP and GCM, consider that users opinions are not
nodes that are critical to graph partition in GCM classiers were independent, but interrelated. Rather than predicting the opinion
excluded in the training set due to low number of tweets and/or of each individual user separately, LP and GCM try to collectively
followers. Therefore, GCM fails to work for the remaining nodes in classify users based on their linkage structures. In a social network
the graph. Nevertheless, for the other two ranking methods rank involving two sides debating, linkages may carry a variety of
by # retweeted and rank by degree, GCMs performance is a lot meanings regarding the relationship between the two connected
more robust. When the training set percentage is reduced from users. In our study, we design the classication algorithm based on
100% to 2%, GCM constantly gives exactly the same classication a highly simplied assumption that a retweeter tends to share the
results (best among all: 96.96% precision, 95.63% recall, and 96.29% same opinion as the retweetee. Even under such a strong
F-measure). Only when the training set is reduced below 2% (i.e., assumption, the classiers can give surprisingly high accuracy.
less than two positive and two negative instances), the GCM Moreover, the results of robustness tests further strengthen our
classier fails to work. conclusion on the superiority of link-based opinion classiers over
CC, particularly when only a small training set is available. As
6. Discussion shown in Fig. 3, CC show ne robustness against reduced training
set size. Particularly for the rank by # tweets method, the
Our experimental study helps us answer the three research classication accuracy (72%) did not drop much until the training
questions. The results also provide us with several interesting set was reduced to approximately 10%. This might be partly
insights into the interaction dynamics of online debates on social attributed to the high degree of information redundancy exhibited
media such as Twitter. in tweets and retweets, especially during a short time period of
The main challenges for opinion mining on Twitter data include debate on one specic topic. Even visibly opinionated users do not
the short length (i.e., 140-character limit) and informal style. Given always post original tweets. They retweet, too. As long as we have
limited word features and high variations in style, traditional CC enough content (tweets and retweets) from the most active users,
showed poor predictive capability for differentiating users with CC can still show appreciable performance. By contrast, for link-
opposite opinions. People can say different entities to express their based classiers, if chosen wisely (e.g., based on rank by #
opinions. However, their precise words may not clearly reveal retweeted and rank by degree), only a handful (e.g., one or two
which side they take in a debate. It is important to consider that a instances from each class) of visibly opinionated users could be
debate must involve people from two sides arguing against each sufcient as training data instances to achieve the highest
other. On each side, remarks of opinionated leaders can be widely classication accuracy (93%). Nevertheless, the link-based
spread by supporters via actions such as retweet and like on a classier, GCM, could become unpredictable if some critical
social media platform. Such interaction and communication annotated nodes are left out (e.g., based on rank by # tweets
results in a high degree of connectivity in social networks. These and rank by # followers). Therefore, the connectivity to other
linkages between users become an additional and evidently more nodes in the graph is key to robust performance of link-based
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
Fig. 5. Robustness Test for GCM (performance vs.% of training data used).
classiers. During an online debate, opinion leaders may not shows high accuracy and robustness to the limited size of labeled
necessarily be the most active tweeters or well-known celebrities data for training.
with the most followers. Rather, given a reasonable size of This study has signicant practical and theoretical implications
followers, someone who posts tweets with sharp opinions or for big data analytics in e-commerce. For practitioners, a more
brilliant remarks, as long as they get more retweets, can stand out accurate classier that requires a smaller training data set can
as real leaders of public opinion. Successfully identifying these signicantly reduce the effort needed to plan and conduct
critical opinionated users could guarantee a high accuracy for marketing campaigns. Facing a social network of competitive
opinion classication with only minimum effort to create a opinions, companies need to develop different marketing
training data set. On the contrary, if inappropriate ranking strategies for populations that are for or against their products.
methods were chosen, one may fail to identify the real critical By conducting opinion classication on a topic at multiple time
opinionated users, which can lead to low accuracy of opinion points, companies can obtain a dynamic view of how opinions
classication. Furthermore, identication of the critical users and evolve over time so that they can react and adjust their strategies
further analyzing their content and behaviors can provide better accordingly. In the big-data era, such data-driven analytics will
insights into the key arguments and political appeal. become increasingly important to commerce. Moreover, our
results provide support for further theoretical studies on the
7. Conclusions and future directions global consistency of social networks in social media analytics. As
we have argued, global consistency may be accounted for by the
In this study, we propose a GCM algorithm to address the joint force of network connectivity and social inuence. Unlike
collective opinion classication problem. Our algorithm most previous research focused on individual-level inuences, our
collectively assigns users states in Twitter discussions that match study indicates the existence of global-level self-organization
their retweeting behaviors to others postings. In experiments on a phenomenon, which is worth further investigation.
real-world data set, the proposed approach is signicantly more In the future, we will explore the following directions to extend
accurate than the state-of-the-art approach. Further analysis this study. (1) We will investigate the combination of content data
shows that our proposed algorithm performs exceptionally well and linkage data for collective opinion classication. (2) We will
even with only a handful of training users. investigate social network structures other than two-sided debates
Classifying user opinions is a critical challenge for e-commerce on social media and develop new opinion classication algorithms.
companies. This study provides a novel opinion classication (3) We will continue studying factors (e.g., topics and stages of
solution that addresses the volume, velocity, and variety issues of events) that can affect the performance of opinion classication in
big social media data. By analyzing a large amount of opinion- more social media data sets and contexts. Our ultimate goal is to
related data in microblogging sites, our solution based on GCM build an effective approach that is scalable to incorporate insights
exploits the social interactions among users for classication. It from social relationships for opinion mining.
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004
G Model
INFMAN 2917 No. of Pages 10
References [26] S. Feng, D. Wang, G. Yu, W. Gao, K.F. Wong, Extracting common emotions from
blogs based on ne-grained sentiment clustering, Knowl. Inf. Syst. 27 (2011)
[1] H. Chen, R.H.L. Chiang, V.C. Storey, Business intelligence and analytics: from 281302, doi:http://dx.doi.org/10.1007/s10115-010-0325-9.
big data to big impact, Mis Q. 36 (2012) 11651188, doi:http://dx.doi.org/ [27] P. Sen, G.M. Namata, M. Bilgic, L. Getoor, B. Gallagher, T. Eliassi-Rad, Collective
10.1145/2463676.2463712. classication in network data, AI Mag. 29 (2008) 93106, doi:http://dx.doi.
[2] R.F. Lusch, Y. Liu, Y. Chen, The phase transition of markets and organizations: org/10.1145/1217299.1217304.
the new intelligence and entrepreneurial frontier, IEEE Intell. Syst. 2010 (2016) [28] R. Agrawal, S. Rajagopalan, R. Srikant, Y. Xu, Mining newsgroups using
7175, doi:http://dx.doi.org/10.1109/MIS.2010.27. networks arising from social behavior, Proc. Twelfth Int. Conf. World Wide
[3] A. Doan, R. Ramakrishnan, A.Y. Halevy, Crowdsourcing systems on the world- Web - WWW 03 (2003) 529, doi:http://dx.doi.org/10.1145/775152.775227.
wide web, Commun. ACM 54 (2011) 86, doi:http://dx.doi.org/10.1145/ [29] P.F. Lazarsfeld, R.K. Merton, Friendship as a social process: a substantive and
1924421.1924442. methodological analysis, Free. Control Mod. Soc. 18 (1954) 1866, doi:http://
[4] C. Forman, A. Ghose, B. Wiesenfeld, Examining the relationship between dx.doi.org/10.1111/j.1467-8705.2012.02056_3.x.
reviews and sales: the role of reviewer identity disclosure in electronic [30] M. McPherson, L. Smith-Lovin, J.M. Cook, Birds of a feather: homophily in
markets, Inf. Syst. Res. 19 (2008) 291313, doi:http://dx.doi.org/10.1287/ social networks, Annu. Rev. Sociol. 27 (2001) 415444, doi:http://dx.doi.org/
isre.1080.0193. 10.1146/annurev.soc.27.1.415.
[5] N. Archak, A. Ghose, P.G. Ipeirotis, Deriving the pricing power of product [31] M. Speriosu, N. Sudan, S. Upadhyay, J. Baldridge, Twitter polarity classication
features by mining consumer reviews, Manage. Sci. 57 (2011) 14851509, doi: with label propagation over lexical links and the follower graph, Proc. Conf.
http://dx.doi.org/10.1287/mnsc.1110.1370. Empir. Methods Nat. Lang. Process (2011) 5356.
[6] A. Ghose, P.G. Ipeirotis, B. Li, Designing ranking systems for hotels on travel [32] C. Tan, L. Lee, J. Tang, L. Jiang, M. Zhou, P. Li, User-level sentiment analysis
search engines by mining user-generated and crowdsourced content, Mark. incorporating social networks, Proc. 17th ACM SIGKDD Int. Conf. Knowl.
Sci. 31 (2012) 493520, doi:http://dx.doi.org/10.1287/mksc.1110.0700. Discov. Data Min. - KDD 11 136 (2011) 1397, doi:http://dx.doi.org/10.1145/
[7] C. Oh, O. Sheng, Investigating predictive power of stock micro blog sentiment 2020408.2020614.
in forecasting future stock price directional movement, ICIS (2011) 119. [33] J. Rabelo, R.B.C. Prudencio, F. Barros, Collective classication for sentiment
[8] J.L. Zhao, S. Fan, D. Hu, Business challenges and research directions of analysis in social networks, 2012 IEEE 24th Int. Conf. Tools with Artif. Intell.
management analytics in the big data era, J. Manag. Anal. 1 (2014) 169174, (2012) 958963, doi:http://dx.doi.org/10.1109/ICTAI.2012.135.
doi:http://dx.doi.org/10.1080/23270012.2014.968643. [34] J. Rabelo, R.B.C. Prudencio, F. Barros, Using link structure to infer opinions in social
[9] E.-P. Lim, H. Chen, G. Chen, Business intelligence and analytics: research networks, IEEE Int. Conf. Syst. Man, Cybern., Seoul, Korea, 2012, pp. 681685.
directions, ACM Trans. Manage. Inf. Syst. 3 (2013) 110, doi:http://dx.doi.org/ [35] M.D. Conover, B. Gonalves, J. Ratkiewicz, A. Flammini, F. Menczer, Predicting
10.1145/2407740.2407741. the political alignment of twitter users, Proc. 2011 IEEE Int. Conf. Privacy,
[10] B. Pang, L. Lee, Opinion mining and sentiment analysis, found, Trends1 Inf. Secur. Risk Trust IEEE Int. Conf. Soc. Comput. PASSAT/SocialCom 2011 (2011)
Retr. 2 (2008) 1135, doi:http://dx.doi.org/10.1561/1500000011. 192199, doi:http://dx.doi.org/10.1109/PASSAT/SocialCom.2011.34.
[11] G. Walsh, K.P. Gwinner, S.R. Swanson, What makes mavens tick? Exploring the [36] F. Wong, C. Tan, S. Sen, M. Chiang, Quantifying political leaning from tweets
motives of market mavens initiation of information diffusion, J. Consum. and retweets, Int. AAAI Conf. Weblogs Soc. Media (2013).
Mark. 21 (2004) 109122, doi:http://dx.doi.org/10.1108/07363760410525678. [37] A. Rajadesingan, H. Liu, Identifying users with opposing opinions in Twitter
[12] J.E. Phelps, R. Lewis, L. Mobilio, D. Perry, N. Raman, Viral marketing or debates, Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell.
electronic word-of-mouth advertising: examining consumer responses and Lect. Notes Bioinformatics) (2014) 153160, doi:http://dx.doi.org/10.1007/
motivations to pass along email, J. Advert. Res. 44 (2004) 333348, doi:http:// 978-3-319-05579-4-19.
dx.doi.org/10.1017/S0021849904040371. [38] Y. Ren, N. Kaji, N. Yoshinaga, M. Toyoda, M. Kitsuregawa, Sentiment
[13] Z. Zhang, X. Li, Y. Chen, Deciphering word-of-mouth in social media, ACM classication in resource-Scarce languages by using label propagation, 25th
Trans. Manage. Inf. Syst. 3 (2012) 123, doi:http://dx.doi.org/10.1145/ Pacic Asia Conf. Lang. Inf. Comput. (2011) 420429.
2151163.2151168. [39] U.N. Raghavan, R. Albert, S. Kumara, Near linear time algorithm to detect
[14] E. Demers, C. Vega, Soft information in earnings announcements: news or community structures in large-scale networks, Phys. Rev. E Stat. Nonlinear Soft
noise? INSEAD Bus. Sch. World (2010) 170, doi:http://dx.doi.org/10.2139/ Matter Phys. 76 (2007), doi:http://dx.doi.org/10.1103/PhysRevE.76.036106.
ssrn.1153450. [40] J.X. Hao, J.B. Orlin, A faster algorithm for nding the minimum cut in a directed
[15] C.C. Yang, Understanding online consumer review opinions with sentiment graph, J. Algorithms 17 (1994) 424446, doi:http://dx.doi.org/10.1006/
analysis using machine learning sentiment analysis using machine learning, jagm.1994.1043.
Pacic Asia J. Assoc. Inf. Syst. 2 (2010) 6. [41] R.H. Mhring, A.S. Schulz, F. Stork, M. Uetz, Solving project scheduling
[16] M.M. Mostafa, More than words: social networks text mining for consumer problems by minimum cut computations, Manage. Sci. 49 (2003) 330350,
brand sentiments, Exp. Syst. Appl. 40 (2013) 42414251, doi:http://dx.doi.org/ doi:http://dx.doi.org/10.1287/mnsc.49.3.330.12737.
10.1016/j.eswa.2013.01.019. [42] T.M. Murali, C.-J. Wu, S. Kasif, The art of gene function prediction, Nat. Biotech.
[17] W. Li, H. Xu, Text-based emotion classication using emotion cause extraction, 24 (2006) 14741475, doi:http://dx.doi.org/10.1038/nbt1206-1474.
Exp. Syst. Appl. 41 (2014) 17421749, doi:http://dx.doi.org/10.1016/j. [43] P.F. Felzenszwalb, D.P. Huttenlocher, Efcient graph-based image
eswa.2013.08.073. segmentation, Int. J. Comput. Vis. 59 (2004) 167181, doi:http://dx.doi.org/
[18] S. Fouzia Sayeedunnissa, A. Hussain, M. Hameed, Supervised opinion mining of 10.1023/B:VISI.0000022288.19776.77.
social network data using a bag-of-words approach on the cloud, in: J.C.
Bansal, P. Singh, K. Deep, M. Pant, A. Nagar (Eds.), Proc. Seventh Int. Conf. Bio- Jiexun Li is an Assistant Professor in the Department of Decision Sciences, College of
Inspired Comput. Theor. Appl. (BIC-TA 2012), Springer, India, 2012, pp. 299 Business & Economics, at Western Washington University. He earned his Ph.D. in
309, doi:http://dx.doi.org/10.1007/978-81-322-1041-2_26. MIS from the Eller College of Management at the University of Arizona, M.S and B.S
[19] Z. Akaichi, Text mining facebook status updates for sentiment classication, in MIS at Tsinghua University in China. His research interests include data mining,
2013 17th Int. Conf. Syst. Theory, Control Comput. ICSTCC 2013; Jt. Conf. SINTES business analytics, social media analytics, and health informatics. His research has
2013, SACCS 2013, SIMSIS 2013 - Proc. (2013) 640645, doi:http://dx.doi.org/ appeared in journals including JMIS, DSS, IEEE Transactions, JASIST, JAIS,
10.1109/ICSTCC.2013.6689032. Bioinformatics, CACM, ESA, ISF, and so on.
[20] S. Ben Hamouda, J. Akaichi, Social networks text mining for sentiment
classication: the case of facebook statuses updates in the arabic spring era,
Int. J. Appl. Innov. Eng. Manage. 2 (2013) 470478.
[21] M. Mysln, S.H. Zhu, W. Chapman, M. Conway, Using twitter to examine Xin Li is an Assistant Professor in the Department of Information Systems at the City
smoking behavior and perceptions of emerging tobacco products, J. Med. University of Hong Kong. He received his Ph.D. in Management Information Systems
Internet Res. 15 (2013), doi:http://dx.doi.org/10.2196/jmir.2534. from the University of Arizona. He received his Bachelors and Master's degrees
[22] A. Hassan, A. Abbasi, D. Zeng, Twitter sentiment analysis: A bootstrap from the Department of Automation at Tsinghua University, China. His work has
ensemble framework, in: Proc. - Soc. 2013, 2013: 357364. 10.1109/ appeared in the MISQ, JMIS, INFORMS JOC, DSS, JASIST, ACM and IEEE Transactions,
SocialCom.2013.56. among others.
[23] F.H. Khan, S. Bashir, U. Qamar, TOM: twitter opinion mining framework using
hybrid classication scheme, Decis. Support Syst. 57 (2014) 245257, doi:
http://dx.doi.org/10.1016/j.dss.2013.09.004. Bin Zhu is an Associate Professor of Business Information Systems at Oregon State
[24] W. Maharani, Microblogging sentiment analysis with lexical based and University. She earned her Ph.D. in Management Information Systems from
machine learning approaches, Inf. Commun. Technol. (ICoICT), 2013 Int. Conf. University of Arizona. Her current research interests include business intelligence,
(2013) 439443, doi:http://dx.doi.org/10.1109/ICoICT.2013.6574616. information analysis, social network, human-computer interaction, information
[25] G. Paltoglou, M. Thelwall, Twitter, MySpace, digg: unsupervised sentiment visualization, computer-mediated communication, and knowledge management
analysis in social media, ACM Trans. Intell. Syst. Technol. 3 (2012) 119, doi: systems. Her work has appeared in ISR, DSS, JASIST, IEEE Transactions, D-Lib
http://dx.doi.org/10.1145/2337542.2337551. Magazine, and so on.
Please cite this article in press as: J. Li, et al., User opinion classication in social media: A global consistency maximization approach, Inf.
Manage. (2016), http://dx.doi.org/10.1016/j.im.2016.06.004