Parallelized Analysis of Opinions and Their Diffusion in Online Sources

Title: Parallelized Analysis of Opinions and their Diffusion in Online Sources
Master Thesis at the Chair for Computer Science
Supervisor: Tutor: Presented by:
Prof. Dr. Gottfried Vossen Dr. rer. nat Jens Lechtenbrger Dominic Steffen Mnster
Date of Submission: 2013/01/08
II
Content
Figures .............................................................................................................................IV Tables ............................................................................................................................... V Abbreviations ..................................................................................................................VI 1 Introduction .................................................................................................................. 1 1.1 Concepts and Common Terms .............................................................................. 3 1.2 Discovery Process ................................................................................................. 5 1.3 Aim of this thesis ................................................................................................... 5 1.4 Structure of this thesis ........................................................................................... 7 2 Sentiment Analysis ....................................................................................................... 8 2.1 Overview ............................................................................................................... 8 2.2 Sentiment Classification ...................................................................................... 10 2.3 Supervised Approaches ....................................................................................... 11 2.4 Unsupervised Lexicon Induction ......................................................................... 22 2.5 Evaluation of classification techniques ............................................................... 28 3 Opinion Extraction ..................................................................................................... 32 3.1 Opinion-Oriented Information Extraction ........................................................... 32 3.2 Context Extraction ............................................................................................... 32 3.3 Sources ................................................................................................................ 36 3.4 Data Collection .................................................................................................... 39 3.5 Preprocessing ....................................................................................................... 41 3.6 Opinion Mining Framework ................................................................................ 47 4 Tracking, Diffusion, and Summarization ................................................................... 50 4.1 Overview ............................................................................................................. 50 4.2 Diffusion .............................................................................................................. 50 4.3 Summarization ..................................................................................................... 54 4.4 Visualization ........................................................................................................ 56 5 Towards parallel execution ......................................................................................... 63 5.1 Overview ............................................................................................................. 63 5.2 The MapReduce Framework ............................................................................... 63 5.3 Hadoop-based Parallel Processing ...................................................................... 66 5.4 Parallelized Opinion Mining ............................................................................... 71 6 System Design ............................................................................................................ 78 6.1 Overview ............................................................................................................. 78 6.2 Process Execution ................................................................................................ 79 6.3 Collection ............................................................................................................ 80 6.4 Analysis ............................................................................................................... 85 6.5 Presentation ....................................................................................................... 102 7 Evaluation ................................................................................................................. 104 7.1 Quality Evaluation ............................................................................................. 104 7.2 Efficiency Evaluation ........................................................................................ 107
III
8 Concluding Remarks ................................................................................................ 108 References ..................................................................................................................... 110 Appendix ....................................................................................................................... 128
IV
Figures
Figure 1: Separating Hyperplane (Prabowo & Thelwall, 2009, p. 151) ......................... 19 Figure 2: Bipolar structure in WordNet for the antonyms "fast" and "slow" (M. Hu & Liu, 2004b, p. 173) ............................................................................................... 24 Figure 3: Simplified extraction process .......................................................................... 32 Figure 4: Subject and Subject-Feature Extraction Workflow (Binali et al., 2009) ......... 34 Figure 5: POS Annotation ............................................................................................... 43 Figure 6: Classifier-based filter ....................................................................................... 44 Figure 7: Sentiment classification with subjectivity filtering ......................................... 45 Figure 8: Opinion Mining Framework ............................................................................ 47 Figure 9: Common generic process steps ........................................................................ 48 Figure 10: Mood-Map (Yerva et al., 2012, p. 2499) ....................................................... 57 Figure 11: Rose Plot (Gregory et al., 2006, p. 26) .......................................................... 58 Figure 12: Graphs illustrating communication patterns and sentiment orientation ........ 59 Figure 13: Comparative Relation Map (Xu et al., 2011, p. 753)..................................... 60 Figure 14: Trend plots (Moodgrapher) (Mishne & Rijke, 2006a, p. 153) ...................... 61 Figure 15: Visualization technique of Wanner et al. (Wanner et al., 2009, p. 4) ............ 61 Figure 16: MapReduce Flow ........................................................................................... 65 Figure 17: Infrastructure of a Hadoop cluster ................................................................. 69 Figure 18: Voting in Ensemble Methods ........................................................................ 73 Figure 19: Using reducers to train ensembles ................................................................. 76 Figure 20: Central Process .............................................................................................. 78 Figure 21: Control classes ............................................................................................... 79 Figure 22: Thread classes ................................................................................................ 80 Figure 23: PipelineStage and ConfigurableStage ............................................................ 87 Figure 24: ExecutableStage and TrainableStage ............................................................. 88 Figure 25: MultiStagePipeline ........................................................................................ 90 Figure 26: Pipeline assembly .......................................................................................... 91 Figure 27: Generic Types ................................................................................................ 92 Figure 28: Aggregation Hierarchy .................................................................................. 97 Figure 29: Document Summarizer ................................................................................ 100 Figure 30: Pipeline Editor ............................................................................................. 103 Figure 31: Dashboard .................................................................................................... 103 Figure 32: Scalability .................................................................................................... 107
Tables
Table 1: Accuracy measures ........................................................................................... 28 Table 2 Generalized interfaces ........................................................................................ 74 Table 3: Sanitizing twitter status updates ........................................................................ 84 Table 4: Stored Twitter metadata .................................................................................... 84 Table 5: Basic properties of PipelineStage ..................................................................... 86 Table 6: Common properties of TrainableStages ............................................................ 89 Table 7: Accuracy results of the evaluation runs .......................................................... 105 Table 8: Accuracy of the OpinionDetector ................................................................... 106 Table 9: Runtime of pipeline ......................................................................................... 107 Table 10: Overview of Classification Approaches ....................................................... 129 Table 11: WordCount Example..................................................................................... 134 Table 12: Classification and Clustering Algorithms of Mahout ................................... 136 Table 13: MPQA Translation file.................................................................................. 138
VI
Abbreviations
API CNB CRF CV HDFS IE IR LG MAP ML MR NB NLP OE OM POS RS SA SGD SVM URL Application Program(ming) Interface Complementary Nave Bayes Conditional Random Field Cross Validation Hadoop Distributed File System Information Extraction Information Retrieval Logistic Regression Maximum a posteriori Machine Learning MapReduce Nave Bayes Natural Language Processing Opinion Extraction Opinion Mining Part-of-Speech Random Sampling Sentiment Analysis Stochastic Gradient Descent Support Vector Machine Uniform Resource Locator
Introduction
Public sentiment is everything. With public sentiment, nothing can fail. Without it, nothing can succeed. - Abraham Lincoln1
What do you think about [X]? is a very common question pattern. As (Pang & Lee, 2008) note, we are very interested in getting to know the opinions other people hold on various subjects. When it comes to doctors, mechanics, restaurants or products such as cell-phones or digital cameras, it is very typical to ask friends and others for recommendations. As the above quote demonstrates long before the Internet was even imagined the prevailing opinion mattered. Now, with the rise of the Internet and its services, expressing and sharing ones own opinion is easy. And so is the ease with which one can find publicly expressed opinions of others. Special websites allow publishing reviews and recommendations for businesses (e.g. Yelp2), movies (Imdb3), any kind of products really (Amazon4). In addition to that, opinions can be published on websites, (we-)blogs, and a myriad of social networks like Facebook 5 , Twitter 6 or Tumblr7. Published opinions can be embedded in various types of data such as text, images, audio or video data (Binali, Potdar, & Wu, 2009). In this thesis, the focus will be on the computational extraction and evaluation of opinions contained in textual data found in various online sources. The central problems can be expressed simply as this: Given a piece of text, does it contain an opinion? If so, is it a positive or negative opinion? And to what entity is it relating to (the subject, or target of the opinion)? For example the sentence The Leicon X-12 makes horrible pictures. contains an opinion as the phrase horrible pictures is not a fact, but rather a subjective judgment. The subject of this judgment in this case is the Leicon X-12, a fictional camera. And a human reader can surely agree that as opinions towards cameras go, this one is not flattering. Applications for this kind of technology are many. Consumers could benefit in their purchasing decisions from summarizations and aggregations of published opinions (Pang & Lee, 2008), while review sites (such as Yelp, Imdb, etc.) can present aggregated content of better quality if they are able to interpret free-form natural language text (Pang & Lee, 2008). In the field of Business Intelligence, the ability to track and evaluate public opinion on own products and those of competitors to answer
1 2
As in (Lincoln, Douglas, & Angle, 1991, p. 128) http://www.yelp.com/ 3 http://www.imdb.com 4 http://www.amazon.com 5 http://www.facebook.com 6 http://www.twitter.com 7 http://www.tumblr.com
business decisions is actively researched (Binali et al., 2009; Pang & Lee, 2008). Imagine the manufacturer of the camera from before. If the product manager would like to understand less than satisfying sales, he could consult evaluations of opinions toward his product. For example, the evaluation and summarization of opinions toward a product and its features, such as lenses or batteries in the case of the camera, as has been developed in (M. Hu & Liu, 2004a, 2004b). Information like this could directly benefit R&D and marketing departments (Binali et al., 2009). Comparative opinions, opinions where the author compares a product with a competing one, can be mined for Competitive Intelligence (Xu, Liao, Li, & Song, 2011). Sentiment analysis can also be used to predict future sales (Y. Liu, Huang, An, & Yu, 2007; Mishne & Glance, 2006; Mishne & Rijke, 2006a) or as a subcomponent to improve customer relations (Gamon, 2004; D. Lee, Jeong, & Lee, 2008; Oelke et al., 2009). Applications also abound in the field of politics (Binali et al., 2009; Pang & Lee, 2008). Opinion mining could be beneficial to election monitoring, prediction, and improvement of opinion polls (Gloor, Krauss, Nann, Fischbach, & Schoder, 2009; OConnor, Balasubramanyan, Routledge, & Smith, 2010; Wanner, Rohrdantz, & Mansmann, 2009). It could also improve the policy design process by providing insight into public sentiments toward policy ideas (Binali et al., 2009; Cardie, Farina, & Bruce, 2006; Kwon, Shulman, & Hovy, 2006). In (Pang & Lee, 2008) a number of studies are cited that investigate how sentiment analysis can contribute toward understanding political positions of voters and politicians. Another potential application field is government intelligence and national security, for example the automatic surveillance and evaluation of web chatter (Abbasi & Chen, 2007; Bermingham, Conway, McInerney, OHare, & Smeaton, 2009) or the analysis of the publics reaction to a catastrophic event (Cheong & Lee, 2010). Further examples of applications can be found in (Binali et al., 2009; Pang & Lee, 2008). The research field which investigates this topic is known in the academic discourse either as Opinion Mining or Sentiment Analysis (Binali et al., 2009; Pang & Lee, 2008). Opinion mining applications in real world use often encounter a practical problem: The sheer volume of published opinions on the web (Binali et al., 2009; Pang & Lee, 2008). Although the availability of data generally benefits analysis approaches (Banko & Brill, 1998; Bishop, 2006; Brants, Popat, & Och, 2007; Halevy, Norvig, & Pereira, 2009), this benefit becomes a drawback if it becomes too costly or infeasible to process the data efficiently. To mine this mountain of data, commonly referred to as Big data Hadoop-
based parallel processing can be used (Manyika et al., 2011; Russom, 2011; White, 2012). Over the course of this thesis, a sentiment tracking engine will be developed. This sentiment tracking engine will take advantage of Hadoop-based parallelization technology. In the remainder of this chapter, important concepts and definitions for common terms will be introduced. Section 1.3 describes the aim of this thesis. The end of this chapter will give the reader an outline of the next chapters. 1.1 Concepts and Common Terms
Opinion Mining is closely related with a number of fields (Binali et al., 2009), namely Information Extraction (IE), Information Retrieval (IR), Natural Language Processing (NLP), Machine Learning (ML) and Web (Data-) Mining. As noted by (Pang & Lee, 2008), the terms Opinion Mining and Sentiment Analysis often describe the same field and have been used interchangeably. In what appears to be the earliest mention (Pang & Lee, 2008) in 2003, (Dave & Lawrence, 2003) define an Opinion-MiningTool as one which would process a set of search results for a given item, generating a list of product attributes (quality, features, etc.) and aggregating opinions about each of them (poor, mixed, good) (Dave & Lawrence, 2003, p. 519). The term Sentiment Analysis stems from the use of the term sentiment in the context of automatic analysis of evaluative text and tracking of the predictive judgments therein (Pang & Lee, 2008, p. 10). To date, no uniform terminology seems to be established (Binali et al., 2009; Andrea Esuli, 2008; Pang & Lee, 2008). A simple, overarching definition of this field can be stated as the computational treatment of (in alphabetical order) opinion, sentiment, and subjectivity in text (Pang & Lee, 2008, p. 8). However, for a better access to the concepts and to allow for a clear structure, in the context of this thesis, a distinction will be introduced: Opinion Mining will refer to the overall process of extracting, preprocessing, and evaluating opinionated text, including the presentation of the evaluation results; Sentiment Analysis will be the term of the more narrower task of evaluating the sentiment expressed in a given document. The terms opinion, sentiment and subjectivity require some distinction as well. In fact, as (Pang & Lee, 2008) point out, opinion and sentiment are in one synonym group as defined by the widely accepted Merriam-Webster 8 online dictionary (MerriamWebster, 2012a). They are both defined as a judgment one holds as true (MerriamWebster, 2012a). A judgment is being held by one entity in regard to another. In the research literature concerned with opinion mining and sentiment analysis, the term
8
http://www.merriam-webster.com
opinion holder (B. Liu, 2010; Pang & Lee, 2008) is the entity which holds the judgment, while the referred entity is the subject (Yi, Nasukawa, Bunescu, & Niblack, 2003). Other terms for subject in common use are target (Yi et al., 2003), topic (Tang, Tan, & Cheng, 2009), object (Binali et al., 2009), entity (Binali et al., 2009), or item (Binali et al., 2009). Although both, opinion and sentiment, share a common definition, there are subtle differences in the context in which one would use them: An opinion is defined as a conclusion thought out yet open to dispute while a sentiment suggests a settled opinion reflective of one's feelings (Merriam-Webster, 2012b). In several literature reviews (Pang & Lee, 2008; Tang et al., 2009) Wiebe (JM Wiebe, 1994) is credited with the introduction of the term subjectivity to the field. Influenced by Banfields theories (Banfield, 1982) and based on the definitions by Quirk et al. (Quirk, Greenbaum, Leech, Svartvik, & Crystal, 1985) 9 , subjectivity is defined by the concept of private states (Quirk et al., 1985). Private states are states which are not open to objective observation and verification, and therefore represent unobservable internal states of a person like the holding of opinions, beliefs, thoughts, feelings, emotions, goals, evaluations, and judgments (Janyce Wiebe, Wilson, & Cardie, 2005). An example10: a person may be observed to assert that God exists, but not to believe that God exists. Belief is in this sense private. In that vein, subjectivity is antonym to objectivity, and a sentence which contains an opinion is judged as subjective. A sentence, on the other hand, which contains only factual statements (facts), objective expressions about entities, events and their properties (B. Liu, 2010, p. 1), would not (B. Liu, 2010). Another commonly used label to describe a non-subjective piece of textual information would be to call it neutral (Baccianella, Esuli, & Sebastiani, 2008; Pang & Lee, 2005; Janyce Wiebe et al., 2005; Wilson, Wiebe, & Hoffmann, 2005). Concluding, an opinion is a subjective judgment which is thought out and may be disputed by others. In the context of this thesis, opinions are subjective judgments and expressed in human-understandable textual form. The evidence of the expression is called opinionated text. The term sentiment refers to the orientation of an opinion that deviates from the neutral state (Yi et al., 2003). The orientation of a subjective statement is represented by a sentiment label assigned from a fixed-sized set. The simplest set is binary and contains the opposing labels {positive, negative} (Pang & Lee, 2008; Tang et al., 2009). In this case, the term polarity is commonly used instead of orientation (Pang & Lee, 2008). Larger sets are referred to as multi-class sets (Tang et al., 2009). One exemplary set would combine polarity with intensity to allow for finer differentiation (Tang et al.,
9
10
As stated in (Pang & Lee, 2008) From (Quirk et al., 1985, p. 1198) as used in (Janyce Wiebe et al., 2005)
2009; Janyce Wiebe et al., 2005). Intensity is the strength of the expressed sentiment (Janyce Wiebe et al., 2005), other terms are force (A Esuli, 2008), and strength (Pang & Lee, 2008). A multi class set combining polarity with intensity may look like this {very positive, moderately positive, low positive, low negative, moderately negative, very negative}. A radically different example is the multidimensional model used by (Akcora, Bayir, Demirbas, & Ferhatosmanoglu, 2010) which is a set of class labels based on a model of the eight basic emotions (Parrot, 2000): {Anger, Sadness, Love, Fear, Disgust, Shame, Joy, Surprise}. While the former multi class set is ordinal, the latter is nominal. Metric schemes have also been proposed (Andrea Esuli & Sebastiani, 2006a). 1.2 Discovery Process
As it is considered best practice in literature reviews to document the source discovery process (Vom Brocke, Simons, & Niehaves, 2009; Webster & Watson, 2002), the process which was employed to find relevant research will be documented at this point. Search engines used were the meta search engine Google Scholar 11 and the search engine of the ACM Digital library12, however the latter only in some cases to verify that Google Scholar would not omit relevant research (it did not). Papers were queried with the keywords Opinion Mining, Sentiment Mining, Opinion Analysis and Sentiment Analysis to gather research on Opinion Mining and related information. For the topic area concerned with parallelization and Hadoop, the keywords Parallel, Parallel Processing, and Hadoop were used alone, but also with the concatenation of an AND-operator followed by the concatenation of the search terms for the field of opinion mining through an OR-operator. For the last topic area concerned with opinion identification, summarization, diffusion, and visualization the keywords diffusion, summarization, visualization OR visualisation were used with the opinion mining search terms in the same manner. The first 80 hits13 for each query were examined and those deemed relevant were added to the research corpus that formed the basis of this thesis. For every paper cited in this thesis, a backward search14 has been executed and relevant references have been added to the corpus. 1.3 Aim of this thesis
After review of the related research, some opportunities for further research have been recognized: Current research in this field went to great length to investigate specific
11 12
http://scholar.google.com http://dl.acm.org/ 13 The cutoff-point was determined by judgement. 14 A backward search is a review of the references of a paper (Vom Brocke et al., 2009)
problems of the field. Previous reviewers of this literature structured the available knowledge of this nascent research field. In those reviews, the authors took great care to discuss and compare the different approaches and the necessary preparatory steps. However at this point, an explicit framework which describes the architecture of a generic opinion mining application does not seem to have been established. A framework like this would be immensely useful, as it would give implementers of a system a reference and would formalize design questions. In addition, it would allow researchers to describe where in the context of other parts of a greater system their contribution would fit in. With the increasing availability of parallel computing technology and the rise of available opinionated content, it seems like a reasonable conclusion that the application of the former to analyze the latter could be immensely beneficial. Especially the Hadoop framework with its particular strength in computation on large, unstructured text-data and an active research into parallelized classification techniques seems to be a technology worth investigating. However, little research into this topic of great practical use seems to have been published. Another research area of academic interest, which has only been marginally investigated until now, is the application of opinion mining technology to track the spread, or diffusion, of opinions within online sources. While a lot of work has been put into research of summarization- and visualization-techniques, research into identification and differentiation of particular opinions is quite sparse. Some groundbreaking work in tracking the emergence, transforming events, and the subsiding of opinions in the course of time has been published. These works however limit the tracking to opinions emerging within one source (e.g. Twitter), but do not investigate how diffusion across different sources can be tracked. Therefore, the three particular research problems this thesis investigates are Development of a framework for opinion mining applications; Review of parallelized processing approaches; Investigation of technological solutions to track opinions.
The reference architecture will be formulated based on derivation of essential component parts of systems described in the reviewed literature. It will be generic and not assume a particular technology for implementation, thus allowing greater applicability.
Based on the literature discovered, the benefits, drawbacks and unique design problems of parallelizing opinion mining will be investigated on the basis of the Hadoop framework. In light of established research, a clustering-based approach to tracking opinions will be formulated. For demonstration and evaluation purposes DOX, a prototypical implementation of an opinion-mining tracking engine, will be part of the work on this thesis. The application will be written in Java and use a number of available open-source software modules. It will be designed to run on a Hadoop-cluster of multiple networked nodes. The application of the developed framework will be shown by using it as the basis of the prototype. Using the framework as the basis, an implementation view will be created. 1.4 Structure of this thesis
Chapter 2 will familiarize the reader with the identification of sentiment orientation using machine-learning classification techniques. The following chapter, chapter 3, will illustrate the overall process of extracting opinions from online sources, from data collection over preprocessing to extraction. This chapter will conclude with a description of the process framework. Chapter 4 will review the diffusion dynamics of opinions, and methods to summarize and visualize the opinion analysis results. The issue of constructing an opinion mining system which can be executed in a parallel and distributed chapter is discussed in Chapter 5. Chapters 6 and 7 present the system design of DOX and its evaluation, respectively. The final chapter in this thesis, chapter 8, will contain concluding remarks.
2
2.1
Sentiment Analysis
Overview
Sentiment analysis is a difficult task as it requires teaching a computer to detect human emotion in natural language written text (Pang & Lee, 2008). It has to recognize the sentence I totally loved that movie as positive, as it conveys a positive opinion, while it has to recognize that This camera is useless expresses a negative sentiment without truly understanding the concept of a sentiment. This chapter will introduce the necessary classification techniques which can be employed to classify a piece of text into a sentiment orientation class like Positive or Negative. This is the central component of opinion mining programs. The classification techniques discussed in this chapter will inform the decision which classifiers will be included in DOX. 2.1.1 Classification
An essential concept for the purpose of opinion mining is classification (Denecke, 2008; Pang & Lee, 2008; Parikh & Movassate, 2009). Classification is the task of assigning a class label from a pre-defined set of classes to a presented case given a set of observed attributes or features of that case (Han, Kamber, & Pei, 2006; Taylor, Michie, & Spiegelhalter, 1994). A case in the current context could be a sentence or document to which one of the two labels in the set {positive, negative} has to be assigned. A classification procedure is a formal method which can make classification judgments repeatedly (Taylor et al., 1994), this is commonly referred to as a classifier (Friedman, Hastie, & Tibshirani, 2009; Han et al., 2006). Formally, the problem of constructing a classifier can be defined as follows (Friedman et al., 2009; Han et al., 2006): Given a tuple drawn from the input space, and an output space of labels , construct a mapping which can predict the class label associated with the tuple X. X represents a case through an n-dimensional vector of observed features, denoted as vector where is the realization of the feature (Friedman et al., 2009; Han et al., 2006). The output space is the set of class labels, which makes a qualitative Variable (Friedman et al., 2009; Han et al., 2006). Constructing a classifier is obviously only then necessary when the mapping is not explicitly known. Common to (machine learning) classification techniques is the requirement of sample data, which is used to construct the mapping function (Friedman et al., 2009; Han et al., 2006; Mitchell, 1997). As the sample dataset, due to constraints of practicality, usually is only a subset of the possible problem space the true mapping function f can only be approximated (Friedman et al., 2009; Han et al., 2006;
Mitchell, 1997). Based on the dataset, the function is induced. The function is chosen from the hypothesis space of all possible mappings so as to minimize a particular Loss-function over the data (Friedman et al., 2009; Han et al., 2006; Mitchell, 1997). The resulting function is often called the best fit (Friedman et al., 2009; Mitchell, 1997), as it is the hypothesis which best explains the data. In that regard, it should be remarked that an overfitted classifier is one that explains the sample data used to construct it very well, but fares poorly on real-world samples (Han et al., 2006). Classification techniques have been the offspring of three different fields of research, namely statistics, machine learning, and neural network theory (Taylor et al., 1994). Statistical approaches can be characterized by having an explicit underlying probability model, which expresses class membership as a likelihood given the features of the presented case (Taylor et al., 1994). Neural Networks can approximate any general function through a network of interconnected nodes each producing a non-linear function of its input as output, imitating the thought processes of a brain (Friedman et al., 2009; Lippe, 2005). Machine Learning defines the field of study which researches automatic computing procedures that have the ability to learn from data without human interaction in the learning process (Mitchell, 1997; Taylor et al., 1994). In the general context, all algorithmic implementations of classifiers that learn from data can be subsumed under the umbrella of machine learning and its paradigms (Mitchell, 1997), and this will also be the context in which classifiers will be discussed in this thesis. 2.1.2 Machine Learning
A machine learning (ML) program is a computer program that can learn a concept based on example data (Friedman et al., 2009; Mitchell, 1997; Taylor et al., 1994). With increased experience, meaning some sort of learning effect has taken place after exposure to example data, it should perform better at a task applying the learned concept (Mitchell, 1997). ML programs have a wide array of uses, among their shown abilities are the abilities to classify astronomical structures, recognize faces (Mitchell, 1997), or classify emails as spam (Conway & White, 2012). Formally, a definition of learning in ML programs is given by (Mitchell, 1997): Definition15: A computer program is said to learn from experience with respect to some class of tasks and performance measure , if its performance at tasks , as measured by , improves by experience . Two important classes of learning can be distinguished in ML: Supervised and unsupervised learning (Friedman et al., 2009; Taylor et al., 1994). Both types of
15
Quote from (Mitchell, 1997, p. 2)
10
learning require example data, referred to as training data (Han et al., 2006). They require slightly different types of training data, however: Unsupervised ML programs require only data of the input space and are expected to find distinctions between the example cases themselves to establish classes (Friedman et al., 2009; Taylor et al., 1994), a process also called clustering (Han et al., 2006; Taylor et al., 1994). Supervised ML programs on the other hand, require a sample of the input space as example cases as well, but also expect class label associations (Friedman et al., 2009; Taylor et al., 1994). The name of this concept refers to the idea that a hypothetical teacher, or supervisor, exists which has the ability to distinguish cases into classes without error and guides the learning process (Friedman et al., 2009; Taylor et al., 1994). An obvious disadvantage of this process is that correctly labeled sample data has to be available. 2.2 Sentiment Classification
The assignment of a sentiment to a given sentence is equal to the assignment of the label positive or negative conditional on the given sentence (in the binary case). Therefore, the task at hand is essentially a classification task. Surveying the literature, there are essentially two different design approaches. This finding is shared by (Gamon, 2004). One approach utilizes a-priori knowledge in the form of lexical resources. The other approach utilizes Machine-Learning techniques. When it comes to distinguishing classification approaches, it is most common to group existing approaches into a class by the kind of learning behavior of the classifier which allows to distinguish between supervised and unsupervised approaches (B. Liu, 2010; Pang & Lee, 2008). This is an important characteristic, as both approaches require different kinds of training data to learn ( 2.1.2). The former requires pre-labeled opinionated-text, while the latter requires training data which holds sufficient relationship information to identify class membership 16 (B. Liu, 2010; Pang & Lee, 2008). Another characteristic, which is so far not recognized in the literature of this topic, is the strength of the assumptions of the underlying model. The surveyed works are usually based on a model of the data, which carries assumptions of varying strengths. The range of the surveyed works in this respect ranged from assumption-low (few assumptions must be valid for the model to hold), for example (J. Lin & Kolcz, 2012), to assumption-high (complicated assumptions must hold). In (J. Lin & Kolcz, 2012), two simple assumption are made: (1) That tweets with emoticon-symbols :-) contain positive sentiment while tweets with Frowny Face-characters :-( are assumed to contain negative sentiment (This assumption is only used to generate labeled training data, after that the emoticons are purged to prevent target-leak (the
16
E.g. positive or negative classes in the binary case
11
emoticons would be the strongest indicator of class membership, so the classifier would basically ignore the other words)); and (2) that words which are close together may have an impact on the orientation of the opinion. Otherwise, no assumptions are made and the classifier is left to its own devices to decide which aspects of a tweet are considered relevant for determination of opinion-orientation. This approach could also be described as data-driven. On the other end of the range, assumption-high approaches are based on models with a number of sophisticated assumptions. Examples would be approaches based on language models like (Jin, Ho, & Srihari, 2009; Yi et al., 2003). 2.3 2.3.1 Supervised Approaches Features
The documents to analyze have to be properly transformed into suitable input-data for the classifier (B. Liu, 2010; Manning, Raghavan, & Schtze, 2009; Pang & Lee, 2008). The document has to be encoded in a data tuple or vector representation which can be processed by the classifier (Han et al., 2006; Manning et al., 2009; Pang & Lee, 2008). The elements in this vector represent presence, absence or quality of a particular property of a document. These properties are referred to as features. Feature selection is an important design decision which impacts classifier quality and efficiency (Pang et al., 2002). Pang & Lees survey (Pang & Lee, 2008) contains a list of features that are commonly used in opinion mining. Those features are term presence and frequency for n-grams, parts-of-speech features, syntax, negation, and topic-oriented features. Liu agrees with this list, but adds opinion words and phrases (B. Liu, 2010). In this thesis, the existing lists are extended. Token-based features Tokenization Tokenization is the process in which a character sequence is split into parts, referred to as tokens (Manning et al., 2009). Tokens are instances of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing (Manning et al., 2009, p. 22). Token-based features are also referred to as terms, which are (potentially standardized) character sequences of a class of tokens with the same sequence (Manning et al., 2009). Tokenization occurs on a predefined
12
document unit (Manning et al., 2009), e.g. on the whole document or a sentence, or any part in between like a paragraph. Character sequences of pre-determined length In (J. Lin & Kolcz, 2012) twitter status updates are tokenized using a sliding-window byte tokenizer of length 4. This means, they created tokens of 4 characters each (as a byte corresponds to a character) from a message (at most) 140 characters long. Using a sliding-window tokenizer, they formed the first token from the first 4 bytes, the second by moving the window 1 byte to the right including the second to fifth byte, and so on until the last token contained the fourth-last until the last byte. This purely data-driven approach has the advantage that it does not make many assumptions other than the assumption that a number of characters in sequence carry enough information to allow classification. In their paper, they were able to show that byte 4-grams are sufficient to create a classifier with 80%-82% accuracy (80% and above is considered good). This however was only possible by training on very large datasets (10 million records and larger) and creating ensembles of at least three classifiers. Word n-grams Text tokenization into n-grams is a common technique in information retrieval (Manning et al., 2009) and it is frequently used in opinion mining (Sanjiv R Das & Chen, 2001; Gamon, 2004; Pang & Lee, 2008; Wiegand & Klakow, 2008). A word ngram is a token formed from n words. Special classes are word unigrams, tokens from one word, and word bigrams, tokens formed from two words (Pang & Lee, 2008). There is some debate in regard to whether or not higher-order n-grams improve classifier quality or if unigrams are sufficient (Pang & Lee, 2008). Term transformation Extracted tokens can be used as features. However, sometimes a number of different but related tokens are mapped into one class identified by a class head, called term. Common word-to-term transformation are stemming (Dave & Lawrence, 2003; Gamon & Aue, 2005; M. Hu & Liu, 2004a; Xu et al., 2011) and lemmatization (Choi & Cardie, 2008; Mullen & Collier, 2004). Spelling errors or frequent uncommon spellings can be mapped to a term class by fuzzy matching (Manning et al., 2009; Parikh & Movassate, 2009). Hashed representations of terms have been used in (J. Lin & Kolcz, 2012). Term presence vs. frequency
13
There are two common ways to represent terms that occur in a document unit: Either in a binary format, where only feature presence or absence is encoded, or in a way that encodes the number of occurrences of the term in a real value (Manning et al., 2009; Pang & Lee, 2008). The number of occurrences is referred to as term frequency, which may either be an integer value (the actual count) or a normalized value (Manning et al., 2009). In Sentiment analysis it has been observed that term presence seems to lead to better classifier quality, compared to term frequency (Pang et al., 2002). For short documents, presence seems to be especially sufficient as frequency and presence vectors will only differ very little from each other (Gamon, 2004; J. Lin & Kolcz, 2012; Manning et al., 2009; Parikh & Movassate, 2009). Position of a token in the document has been used as a feature. In (Pang et al., 2002) position of a token had been encoded in a very coarse granularity (Position classes were First quarter, Last quarter or remainder of the document). Position information did not improve classifier quality significantly, but this may be because of coarse granularity. Linguistic features On a higher level, semantic and syntactic information can be incorporated into the classifier. Linguistic features can be derived from processing text properties like PartsOf-Speech, syntax, or valence shifters: Parts of Speech Including Parts of Speech (POS) features is a fairly basic technique to add linguistic input to the classification process (Pang & Lee, 2008). A Part of Speech refers to a (linguistic) class of a word (Kroeger, 2005, p. 33ff), and in the context of sentiment analysis commonly means the identification of membership of a word in one of the following classes: Noun, Adjective, Adverb, Verb, Preposition and other such categories (Pang & Lee, 2008). Commonly POS information will be added to the text by annotation, referred to as tagging (Pang & Lee, 2008). POS-Tagging is the process which will add a POS class label to a word in the original document or a derived term by a tagger (Turney, 2002). An example is given below: 1: I love this house POS(love) =: verb_love 2: I am in love POS(love) =: noun_love POS identification can serve in two functions: It can aid in word sense disambiguation, and it can guide feature selection. Using POS information for word sense disambiguation is a technique not limited to sentiment classification, but quite common
14
for general text analysis (Pang & Lee, 2008; Wilks & Stevenson, 1998). It can also add context to other NLP approaches (Xia & Xu, 2007). POS identification can also help identify important features (Pang & Lee, 2008). Presence of adjectives is considered a good indicator of subjectivity in text (Hatzivassiloglou & Wiebe, 2000), which has led to a subsequent effort in sentiment analysis to exploit this. However, using only adjectives as features seems to be inferior to using the baseline approach of using unigrams (Pang et al., 2002). Turney suggests that the reason for this is the lack of sufficient context (Turney, 2002). Words of other POS classes can also carry sentiment (Pang & Lee, 2008). Riloff et al. for example investigate sentiment carrying nouns (Riloff, Wiebe, & Wilson, 2003). There are also a number of works which specifically extract phrases characterized by specific POS-patterns. The first consideration of this technique is attributed to (Turney, 2002) who extracted phrases of two words (bigrams) which complied with one of five pre-determined patterns. These patterns were based on the POS classes adjective (JJ) 17, nouns (NN), adverbs (RB), and verbs (VB) as well as certain subcategories (such as singular and plural nouns). Each extracted phrase had to contain at least one adjective or adverb. Subsequent work has employed POS patterns to extract sentiment-bearing words (Gamon, 2004; Jin et al., 2009; Yi et al., 2003). Syntax Incorporating syntax features into the classifier input has been attempted as well (Pang & Lee, 2008). But while it appears that incorporation of syntax features can improve classifier performance on short documents (Kudo & Matsumoto, 2004) (compared to the unigram baseline), experience is at this point mixed (Pang & Lee, 2008). Valence shifters The inclusion of valence shifters into the set of features can also improve classifier quality. Valence shifters are negation, intensifiers and diminishers (Pang & Lee, 2008). Negation often transforms the polarity of a sentence (Choi & Cardie, 2008; Pang & Lee, 2008). A word which is associated with a negative sentiment, like doubt for example, could be present in a sentence which is actual positive, because in context of other negator words (like no as in no doubt) mean something positive, the polar opposite of the connotation of the single term. An example is given in (Choi & Cardie, 2008, p. 793):
17
Turneys symbols for these classes
15
1: [I did [not] have any [doubt] about it.]+ 2: [The report [eliminated] my [doubt] .]+ 3: [They could [not] [eliminate] my [doubt] .] In the first two example sentences, the negative sentiment associated with the word doubt in isolation is transformed in the presence of the negators not and eliminated. In the third example the presence of both negators created a double negation, restoring and reaffirming the original sentiment associated with doubt. A common technique to include negation in the feature set of the classifier is to use a negation detector to tag words that fall into the scope of a negator as their opposite (Sanjiv R Das & Chen, 2001; Pang et al., 2002). Example transformation of sentence 1 using unigrams following the heuristic used in (Pang et al., 2002): (I,did,NOT_have,NOT_any,NOT_doubt, NOT_about,NOT_it) Note that this will increase the number of observable tokens to roughly the number of observable unigrams multiplied by 2 as for every unigram-feature another feature will be introduced. Inclusion of negation has been shown to improve classifier quality (Choi & Cardie, 2008; Ohana & Tierney, 2009; Parikh & Movassate, 2009). Other properties Topic-Oriented Features Some works have attempted to account for interactions between the topic of a document and the expressed sentiment, known as Topic-sentiment interaction (Pang & Lee, 2008). The majority of the work attempt to account for this interaction by feature generalization (Pang & Lee, 2008), e.g. tagging references to the work of art that is subject of a review as THIS_WORK (Mullen & Collier, 2004), or replacing the references of the subject with a generalized label as in (Soo-min Kim, Hovy, Way, Rey, & Edu, 2007). 2.3.2 Techniques
Nave Bayes The Nave Bayesian18 (NB) classifier is a frequently used classifier in the field of IR (Manning et al., 2009; Rennie et al., 2003), and therefore it is not surprising to find it
18
Also known as simple Bayesian classifier (Han et al., 2006)
16
applied in early papers on sentiment analysis (S. Das & Chen, 2001; Pang et al., 2002). The classifier is based on Bayes Theorem (Friedman et al., 2009; Han et al., 2006): Definition19: Given a feature tuple (evidence) and a hypothesis that the tuple belongs to a specified class , the a posteriori probability of conditioned on is defined as ( | ) ( | ) ( ) ( )
The a posteriori probability ( | ) is the probability that the Hypothesis holds, given the evidence. The conditional probability of given the hypothesis is denoted by ( | ). ( ) is the overall probability of occurrence of the hypothesis, while ( ) denotes the same for the evidence (also known as the a priori probabilities for and ). Based on this theorem, a Bayesian classifier consists of a technique estimating the probabilities on the right hand side and a decision rule (Manning et al., 2009; Mitchell, 1997). Given a number of candidate hypotheses (potential class label assignments), it usually suffices to find the most probable one (Han et al., 2006; Manning et al., 2009; Mitchell, 1997), this decision rule is known as finding the maximum a posteriori hypothesis (Manning et al., 2009; Mitchell, 1997). This hypothesis is denoted as and is defined as follows: Definition20: ( | )
As it is difficult and computationally expensive to compute conditional probabilities ( | ) for input data with many features (Han et al., 2006), the underlying model of the nave Bayes classifier assumes that features are conditionally independent (Han et al., 2006; Manning et al., 2009; Mitchell, 1997). This allows formulating the conditional probability ( | ) in the following way: Definition: Given a feature tuple (evidence) with dimension and a hypothesis
that the tuple belongs to a specified class , the a posteriori probability of conditioned on under the independence assumption is defined as ( | ) ( ) ( | ) ( )
As the probability ( ) is not relevant when determining the MAP hypothesis, the ( ) ( | ) , requiring decision rule can be expressed as
19 20
From (Han et al., 2006, pp. 310311) Based on (Mitchell, 1997, p. 157)
17
only the estimation 21 of ( ) and ( | ) (Han et al., 2006; Manning et al., 2009; Mitchell, 1997). The assumption that features occur independently of one another allows implementations of this classifier to exhibit a high learning speed (Han et al., 2006; Rennie et al., 2003), while the implementation itself is quite easy (Rennie et al., 2003). However one has to keep in mind that the independence assumption will often not hold true, especially when working on text (Manning et al., 2009; Pang et al., 2002; Rennie et al., 2003). In general, violation of the independence assumption will not result in a catastrophic failure of the classifier, as the classifier is very robust (Manning et al., 2009), and the advantage in speed and easy implementation can make the trade-off in accuracy a viable option (Han et al., 2006). In fact, NB classifiers are a good choice when the amount of training data is large (Manning et al., 2009). It is important to be aware of specific systemic problems the classifier exhibits and of the underlying assumptions. For example, common implementations of the NB classifier exhibit a bias toward certain classes if training data are skewed (one class is more frequent in the sample than others) (Rennie et al., 2003). Typically, one pre-empts this problem by using balanced data sets (Parikh & Movassate, 2009). A better approach would be to use an algorithm that is less biased such as Complementary Nave Bayes (CNB) (Rennie et al., 2003). Another important decision when designing the classifier is the choice of the underlying probability model. Common models assume features are distributed according to a multinomial distribution or a multivariate Bernoulli distribution, especially when modeling language on the basis of word-tokens (Manning et al., 2009). The difference between the two being, that under a multinomial model feature-frequency will be considered while under the Bernoulli model feature-presence is relevant (Manning et al., 2009). Under the multinomial model, a document-representation as a vector would be drawn from a positive, integer-valued n-dimensional space [ ] where is the number of occurrences of the feature , or alternatively from where the feature would be represented through a normalized frequency from the interval zero to one (Manning et al., 2009). The vector representation of a document under the Bernoulli model would be drawn from an n-dimensional unity space, e.g. { } , where a value of 0 for feature would indicate its absence and a value of 1 its presence (Manning et al., 2009).
21
( ) may be omitted if classes have equal probability (Homogenous sample) (Han et al., 2006)
18
Important assumptions in both models are conditional independence (occurrence of a feature is not dependent on the occurrence of another) and positional independence (feature-order is not relevant) (Manning et al., 2009), and violations of these specific assumption can have an adverse impact on the quality of the classifier (Manning et al., 2009; Rennie et al., 2003). A simple example of a violation of the independence assumption is stated in (Rennie et al., 2003): When classifying a document into the categories San Francisco and Boston using a NB-classifier on unigram features, the classifier will be biased towards classifying a document towards the San Francisco class as both terms frequently co-occur while being treated as individual tokens. The term San Francisco will have double the weight compared to the term Boston, skewing the accuracy. The authors suggest that implementations should employ weightnormalization to correct for this bias (Rennie et al., 2003). Due to the assumption of positional independence, feature-position (order of features in the vector) will not be considered by Nave Bayes. But depending on the source, feature position may contain relevant information: In (Pang et al., 2002), the authors suggested that sentiments found at different positions in a movie review should be weighted differently, as these documents may have an underlying structure (overall sentiment, plot discussion, summary of the review). They tagged features with positional information (feature appears in first quarter, last quarter, or middle of the document). Although this is a fairly easy way to encode positional information, it increases the dimensionality of the feature vector, and it did not yield any improvements in accuracy in the described case because either the assumption was wrong (position is not relevant) or the model was too simple. Deciding whether to use a multinomial or Bernoulli model is dependent on the assumed underlying distribution of features (Manning et al., 2009; Rennie et al., 2003).When modeling language and using n-grams as features, it should be noted that for very short pieces of text (such as sentences and Twitter status updates) multinomial and Bernoulli models will be very similar (Manning et al., 2009; Parikh & Movassate, 2009). And although for most document classification tasks the multinomial model will perform better when classifying longer documents (Manning et al., 2009), for sentiment classification it has been noted that working only with feature-presence will achieve better results (Pang et al., 2002).
19
Support Vector Machines (SVM) Support Vector Machines attempt to find the largest decision boundary between two classes (Friedman et al., 2009; Han et al., 2006). In essence, given a set of two class labels and labeled training data the algorithm will attempt to find a linearly separating margin between the two classes in the feature space (Friedman et al., 2009; Han et al., 2006). The separating margin is called the maximum marginal hyperplane (MMH) (Han et al., 2006). Two cases are distinguished: If a linear separation can be drawn in the input space, the problem is considered linearly separable (Friedman et al., 2009; Han et al., 2006); if not, the problem is linearly inseparable and the data will be transformed into a higher dimension using a nonlinear mapping where data can be separated linearly (Friedman et al., 2009; Han et al., 2006).
Figure 1: Separating Hyperplane (Prabowo & Thelwall, 2009, p. 151) SVMs have been used in sentiment classification (Pang et al., 2002; Parikh & Movassate, 2009; Prabowo & Thelwall, 2009; Wang, Wei, Liu, Zhou, & Zhang, 2011). { }. Labeled training data consist of Class labels are drawn from the set records , which contain vector-representations of documents and are assigned a class label (Han et al., 2006; Prabowo & Thelwall, 2009). Feature-vectors of dimension can be drawn from a space (Allowing therefore a wide range of numerical representation) (Prabowo & Thelwall, 2009).
20
A separating hyperplane is defined by , where is a weight vector, is the bias (a scalar), and a vector (Han et al., 2006). In the case that the data from the input space are linearly separable , otherwise is mapped into a higher ( ) where denotes a nonlinear mapping (Han et al., 2006), dimensional space also known as kernel. The MMH can be computed by solving a convex quadratic optimization problem (Han et al., 2006). The decision rule based on the MMH is stated as22: ( ) ( ){ ( ) ( )
( ) a label Where ( ) is the appropriate representation of a document and function. This formula can be interpreted as the definition of the sides of the margin. SVMs tend to perform better with less noise in the data, as they penalize features for crossing over the boundary (Friedman et al., 2009; Prabowo & Thelwall, 2009). This means, that features for which values tend to be distributed across their feature space without regard to their class label will be weighted less (Prabowo & Thelwall, 2009). For example the word the will be distributed in documents without r egard to opinion orientation. For the domain of sentiment classification, SVMs may fail on training data that contains vectors which contain features that frequently occur in positive and negative documents (Prabowo & Thelwall, 2009). Feature selection can optimize the results (Prabowo & Thelwall, 2009). SVMs also require sufficient amount of records (> 400) in the training set to achieve better accuracy (around 2000 seems to be a sufficient lower bound) (Prabowo & Thelwall, 2009). It has been noted that as with the Nave Bayes classifier, using feature presence when working on n-grams creates a much more accurate classifier (Pang et al., 2002; Prabowo & Thelwall, 2009). All papers reviewed, except two (Kobayashi, Inui, & Matsumoto, 2007; Mullen & Collier, 2004), did either explicitly state to use linear kernels or did not mention which kernel function was used at all. The latter case leads to the default assumption that linear kernels have been used. Mullen & Collier explicitly experimented with polynomial kernels, but did not find any consistent benefits over the use of the default linear kernel (Mullen & Collier, 2004). Therefore, at this point there is no evidence to suggest that other than the basic linear kernel function with default values leads to improvements in classifier quality (Gamon, 2004; Mullen & Collier, 2004).
22
Modified from (Han et al., 2006, p. 340; Prabowo & Thelwall, 2009)
21
Logistic Regression The Logistic Regression is a classifier which uses linear functions to model the a posteriori probabilities of class membership under given evidence (Friedman et al., 2009; Zhang, Jin, Yang, & Hauptmann, 2003). The model is restricted, to ensure that the cumulative sum of a posteriori probabilities always sums up to 1 for any presented evidence (Friedman et al., 2009). The probabilities are restricted to the interval [0,1] (Friedman et al., 2009). The definition of the a posteriori probability under the LR model is given below: Definition23: Given a feature tuple (evidence) and a hypothesis that the tuple belongs to a specified class , the a posteriori probability of conditioned on is defined as ( | ) The parameter set {
( ) (
)}
has to be determined by fitting, i.e.
training the model on data (Friedman et al., 2009). For the binary classification case, the conditional probability can be expressed in the operationalized version (Zhang et al., 2003): (| ) ( ( ))
{ }, and the evidence vector as x. The Where the class label is denoted as intercept and the weight vector form the decision hyperplane. The vectors and must be identical in size. The model is typically fitted using maximum likelihood (Friedman et al., 2009) or Stochastic Gradient Descent (SGD) (Bottou, 2010). LR is closely related to SVMs, and it has been shown that LR can approximate the results of SVMs in large-scale text applications (Zhang et al., 2003). LR has been successfully used as a classifier in large-scale sentiment mining applications (Khuc, Shivade, Ramnath, & Ramanathan, 2012; J. Lin & Kolcz, 2012). 2.3.3 Training Data
Labeled data can be created by manual annotation. Pre-labeled data can be obtained by using reviews or feedback with an attached signal, such as a rating for example (Gamon, 2004; Mullen & Collier, 2004; Pang & Lee, 2008). Usually the rating needs to be transformed into the appropriate domain, e.g. in two binary classes positive and
23
Adapted from (Friedman et al., 2009, p. 119)
22
negative for polarity classification (Gamon, 2004; Mullen & Collier, 2004; Pang et al., 2002; Pang & Lee, 2008). In (Gamon, 2004), the authors used customer feedback with an attached rating from 1 (not satisfied) to 4 (satisfied) and translated those into two sentiment classes. In (Mullen & Collier, 2004; Pang et al., 2002) movie and record reviews were used. Transformation into a binary domain can be performed by labeling those reviews with the highest opposing categories as negative if they have the lowest rating and positive if they have the highest rating (Gamon, 2004). This will create a training set in which the opinions should have the highest contrast. Depending on need, more records can be added by labeling the class with the second highest [lowest] rating as positive [negative] and so on (if there are more rating classes) until all records have been labeled (Mullen & Collier, 2004; Pang et al., 2002). Of course, each record should only occur in one class. Commonly, classes which represent the average rating (e.g. 3 in a 5 class rating system) are not included in the labeled training set as they are considered as not including a sentiment in either orientation (Mullen & Collier, 2004; Pang et al., 2002). Another example is the user of user-supplied annotations which convey a subjective state: One example is the emoticon trick (Bifet & Frank, 2010; J. Lin & Kolcz, 2012; Pak & Paroubek, 2010): As users sometimes annotate their status updates with Smileys or emoticons, symbols which are considered to encode and convey a certain emotional state, they create a sentiment indicator which is exploitable (Bifet & Frank, 2010). By compiling two lists, one of positive emoticons and one of negative emoticons, one can generate a set of labeled data by querying for documents keywords supplied from those lists. Documents retrieved in this fashion are labeled with the class label of the keyword emoticon used to retrieve it. This emoticon trick is commonly used to create labeled training sets of short status updates from Twitter (Bifet & Frank, 2010; J. Lin & Kolcz, 2012; Pak & Paroubek, 2010). 2.4 Unsupervised Lexicon Induction
Unsupervised lexicon induction is based on the paradigm that small text units have an intrinsic sentiment orientation which can be quantified or labeled (Ding, Street, Liu, & Yu, 2008; Andrea Esuli & Sebastiani, 2006a; B. Liu, 2010; Pang & Lee, 2008; Tang et al., 2009). These text units are referred to as opinion words and phrases (or sentiment words) (Ding et al., 2008). Words which encode a desirable state have are attributed a positive orientation while words encoding an undesirable state are attributed a negative orientation (Ding et al., 2008). If the sentiment orientation or the sentiment class label is known for a set of opinion words, a function applied on a document can take the presence of those terms as evidence for the orientation of the document (Ding et al., 2008; Pang & Lee, 2008). It is then a matter of finding an appropriate aggregation function, often called a score function (Denecke, 2008; Ding et al., 2008). A simple
23
example of this concept is given in (Pang et al., 2002): Two humans were tasked to create a small list of indicator words for positive and negative sentiments in movie reviews. They independently created a set of terms (with less than twenty entries), were the terms were either labeled positive or negative (Example: Positive {dazzling, brilliant, phenomenal,}; Negative {suck, terrible, awful,}). The classification function then simply counted the number of occurrences of positive and negative words, and classified the document as belonging to the class with the highest count (On a set of 1400 movie reviews with uniform class distribution this simple approach achieved an accuracy of 58% and 64% (depending on the list used), while tying in 75% and 39% of the cases respectively (Pang et al., 2002)). Of course, this example only serves to illustrate the basic concept of this approach. The manual creation of such lexical resources is a tiresome process, especially as they require a large number of entries (B. Liu, 2010; Tang et al., 2009). Lexical resources that contain sentiment indicating words or phrases with associated sentiment class label or quantification of the intrinsic sentiment are called sentiment lexicon (Pang & Lee, 2008), opinion lexicon (B. Liu, 2010), or semantic orientation word lexicon (Tang et al., 2009). Terms in a sentiment lexicon may be reduced to a canonical representation through stemming (M. Hu & Liu, 2004a), use of lexemes (Andreevskaia & Bergler, 2006). Automatic methods of unsupervised lexicon induction can be divided into two categories: Dictionary-based approaches and corpus-based approaches (Andreevskaia & Bergler, 2006; B. Liu, 2010; Tang et al., 2009). 2.4.1 Dictionary-based approaches
Dictionary-based approaches are characterized by their use of seed words and a (online) dictionary (B. Liu, 2010; Tang et al., 2009). Based on a small set of seed words with known opinions, these approaches utilize the dictionary to bootstrap the lexicon. Most commonly, the dictionary used is Princetons WordNet 24 (Andreevskaia & Bergler, 2006; Ding et al., 2008; Andrea Esuli & Sebastiani, 2005, 2006a, 2007; Fellbaum, 1998; Kamps, Marx, Mokken, & Rijke, 2004; Miller, 1995). Other considered dictionaries are the General Inquirer (GI) dictionary and Merriam-Webster (Andrea Esuli & Sebastiani, 2005; Kamps et al., 2004; Wilson et al., 2005). A simple technique is to exploit lexical relations (synonym/antonym groups) in WordNet using an iterative process: Given a seed list of words with a known orientation, the process labels synonyms of a known word with the same orientation as their seed and antonyms with the polar opposite orientation; then it adds the newly discovered words to the list of known words and the process is repeated until no new words are found (Andreevskaia
24
http://wordnet.princeton.edu/
24
& Bergler, 2006; Andrea Esuli & Sebastiani, 2005; M. Hu & Liu, 2004b; B. Liu, 2010). An example is the work of Hu and Liu (M. Hu & Liu, 2004a, 2004b) where in a first step a list of opinion words with unknown orientation is compiled by association mining; Then the orientation of these words is determined by the previously described iterative process on WordNet synonym/antonym groups (Figure 2) by discovering a relationship to a known seed word (Words not in the list generated in the first step are dropped, as are opinion words for which no orientation can be determined).
swift dilatory
prompt
sluggish
alacritous
fast
slow
leisurly
quick
tardy
rapid
laggard
Figure 2: Bipolar structure in WordNet for the antonyms "fast" and "slow" (M. Hu & Liu, 2004b, p. 173) As words may occur in multiple synonym sets, or synsets, Kim and Hovy attempted to consider this in their process by using statistical classification models (S.-M. Kim & Hovy, 2004), while Andreevskaia and Bergler introduced fuzzy sets of sentiment categories (Andreevskaia & Bergler, 2006). Distance measures were also used to determine orientation (Kamps et al., 2004; Turney, 2002): Given two polar opposite words (representatives of their classes) a distance metric is employed to determine the relative closeness to the class. In (Kamps et al., 2004) a metric EVA based on Osgoods semantic differential (Osgood, Succi, & Tannenbaum, 1957) has been employed using the words bad and good as class representatives and measuring the distance along a path in WordNets relationship graph; Turney (Turney, 2002) based his measure on Church & Hanks Pointwise Mutual Information (PMI) (Church & Hanks, 1989) and determined class membership by comparing the number of hits a search engine returned for queries searching for the co-occurrence of the unlabeled word with the representative words (excellent and poor) in indexed documents.
25
( )
( (
( )
Formula 1 Kamps et al.'s EVA metric25 (w is the unlabeled word) ( ) ( ( ( ) ) ( ( ) ) )
Formula 2 Turney's Semantic orientation measure26 Centrality measures have also been used to determine how strong a membership in a category is (Andreevskaia & Bergler, 2006; Andrea Esuli & Sebastiani, 2007). Supervised ML techniques can also be employed for the final classification: In (Andrea Esuli & Sebastiani, 2005) supervised learners (NB, linear SVM, Rocchio) were trained on the seed lists and were subsequently used to determine the final sentiment polarity of the discovered terms; while this work concentrated on polarity classification, a later work expanded the scope to include subjectivity classification (Andrea Esuli & Sebastiani, 2006b). Beyond lexical relations, which also include hyponyms and hypernyms (Andreevskaia & Bergler, 2006; Andrea Esuli & Sebastiani, 2005), WordNet glosses can also be used as a feature in sentiment orientation determination (Andreevskaia & Bergler, 2006; Andrea Esuli & Sebastiani, 2005). Glosses can be used in conjunction with a POStagger to disambiguate meaning of words and thereby improve the accuracy of classification of the opinion words (Andreevskaia & Bergler, 2006). 2.4.2 Corpus-based approaches
Corpus-based approaches learn from syntactic and co-occurrence patterns of words in a text corpus (B. Liu, 2010; Tang et al., 2009), but may also use seed words (B. Liu, 2010), and often rely heavily on linguistic patterns to extract opinion words from the corpus (Hatzivassiloglou & McKeown, 1997; Tang et al., 2009). In essence, given a corpus of training documents like a set of blog posts, corpus-based approaches attempt to distinguish class membership by applying rules or patterns to discovered relationships between the text elements. Central concepts that can be exploited are
25
See (Kamps et al., 2004, p. 1116); w is the unlabelled word; d is a distance measure over a WordNet relationship graph. 26 See (Turney, 2002, p. 422); w is the unlabelled word; Hits: Number of documents in response to Altavista queries (With NEAR operator); Positive values indicate positive orientation (closer to excellent) while negative values indicate negative sentiment orientation.
26
sentiment consistency and opinion context (interactions between features and subjects) (B. Liu, 2010). Sentiment consistency is the name of the immediately intuitive concept that two opinion words connected by a connective should exhibit a sentiment orientation in accordance with the type of the connection: Conjoint opinion words, illustrated by the phrase happy and excited, should have the same orientation while disjoint opinion words, like beautiful but difficult, should not (Hatzivassiloglou & McKeown, 1997; B. Liu, 2010). Examples of connectives in the English language other than and and but are or, either-or, and neither-nor (B. Liu, 2010). The concept of sentiment consistency can be applied on an intra-sentential level (within a sentence, as described above) or on an inter-sentential level (Consistency between neighbouring sentences) (Kanayama & Nasukawa, 2006; B. Liu, 2010). In the inter-sentential context, a sentiment orientation usually does not change along a few consecutive sentences while a change can be indicated by adversative expressions like but and however (Kanayama & Nasukawa, 2006; B. Liu, 2010). The expanded concept of inter- and intra-sentential consistency was introduced in (Kanayama & Nasukawa, 2006) and is referred to as context coherency (Kanayama & Nasukawa, 2006). Extraction of opinion words and phrases relies on rules based on linguistic constraints and conventions on connectives (Hatzivassiloglou & McKeown, 1997; Kanayama & Nasukawa, 2006; B. Liu, 2010). Approaches that employ this technique should take care to account for violations of the base assumption (e.g. beautiful but happy, clearly an uncommon but still possible co-occurrence with the potential to distort the attributed orientation) (B. Liu, 2010). The approach can be applied in the following manner: In (Hatzivassiloglou & McKeown, 1997) conjunctions of adjectives were extracted from the corpus based on patterns. A log-linear regression model was then used to determine whether two conjoint adjectives are of the same or different orientation (Two compensate for the violation of the assumption). Output of the preceding step is a graph with labelled edges between the adjectives; the labels are the hypothesized orientation. Adjectives are then clustered into two subsets based on the orientation, finally labelling the group with the higher average frequency as positive. Opinion context refers to the interaction of opinion sentiment with other properties of the opinion, such as sentiment-subject interactions (Qiu et al., 2009), or sentimentsubject-feature interactions (Ding et al., 2008). Approaches which exploit opinion context extract opinion words based on exploitable linguistic patterns which are a result of these interactions (Ding et al., 2008; B. Liu, 2010; Qiu et al., 2009). In (Qiu et al., 2009) opinion words and subjects were simultaneously extracted based on syntactic relationship patterns between them using a set of eight extraction rules and a seed list
27
comprising of known opinion words and subjects. Because opinion words and subjects were extracted they termed their method double propagation (Qiu et al., 2009). This approach has been executed on a corpus of product reviews, following the observation that opinion words are almost always associated with [product] features (Qiu et al., 2009, p. 1199) (features = subject). Polarities are assigned based on two rules which assume intra-review consistency (Subjects, here product features, mentioned multiple times in a review are assumed to be associated with the same sentiment) and intracorpus opinion-word-orientation consistency (Opinion-words are used to express the same opinion-orientation across all reviews) (Qiu et al., 2009). The concept of double propagation has been applied in combination with the concept of coherence in (Ding et al., 2008), but the latter applied globally: If the orientation of an unknown opinion word cannot be discovered from the current inter-document context, the orientation will be assigned based on the orientation learned from sentences in other documents in which the unknown opinion word is present along with the same feature and a connected known opinion word27. The use of classifiers based on sentiment lexicons comes with some caveats: The orientation of an opinion word in a particular sentence may differ from the orientation assigned to it in the lexicon, depending on context (Ding et al., 2008; B. Liu, 2010). The opinion word good for example is neutral in the sentence I am looking for a good health insurance (B. Liu, 2010). In addition, many works only focus on the extraction of certain types of opinion words such as adjectives and adverbs, while verbs and nouns may also carry sentiment (Ding et al., 2008). Corpus-based and dictionary-based lexicons also exhibit different qualities as a result of their particular creation techniques: Corpus-based approaches generally fail to identify all possible opinion words of a language as it is unlikely that all words will be present in the selected training corpus (Ding et al., 2008; B. Liu, 2010). They however are good in discovering domain-specific opinion words and their orientations if documents of the training corpus are all from a single specific domain (B. Liu, 2010). As a consequence, dictionary-based approaches may be too general. Features based on sentiment lexicons have been successfully used in a number of supervised classification approaches (Choi & Cardie, 2008; Mullen & Collier, 2004; Ohana & Tierney, 2009; Prabowo & Thelwall, 2009; Whitelaw et al., 2005).
27
See (Ding et al., 2008, p. 236)
28
2.5 2.5.1
Evaluation of classification techniques Evaluation Dimensions
Classification techniques and their implementations can be judged and compared along five dimensions (Han et al., 2006): Accuracy, speed, robustness, scalability, and interpretability (Taylor et al. (Taylor et al., 1994) list four dimensions which agree in definition by and large with (Han et al., 2006), differences are noted.). Accuracy refers to the ability of the classifier to correctly and reliably predict the class label of previously unseen and unlabeled data (Han et al., 2006; Taylor et al., 1994). Accuracy can be estimated on the basis of one or more sets of data which are independent of the dataset on which the classifier has been trained (Han et al., 2006). These datasets are the test data of the classifier (Han et al., 2006). Accuracy can be quantified with accuracy measures, typical measures used to assess the accuracy of a classifier are the accuracy measure, precision, recall, and the F1 measure (Han et al., 2006; Tang et al., 2009), which are listed in Table 1. It has been suggested that the accuracy measure is sufficient to evaluate the quality of a sentiment classifier (Tang et al., 2009). Formula Accuracy Description Sum of correctly classified instances divided by the number of all instances. Ratio of the number of correctly classified instances and the number of all classified instances for a category c. Recall Ratio of correctly classified instances and all instances of the category in the set. Equally weighted measure of precision and recall.
Precision
F1 measure
Table 1: Accuracy measures
29
The next dimension, Speed, expresses the computational cost of (a) training the classifier, as well as (b) the cost of it making a judgment (Han et al., 2006; Taylor et al., 1994) 28 . Speed can be assessed by measuring the runtime of the classifier in a reproducible and representative environment (System configuration and data sets) (Khuc et al., 2012). Robustness refers to the classifiers ability to make correct predictions when presented with noisy or incomplete data (Han et al., 2006). Noisy data, for example variations in spelling, may violate the assumptions of linguistic approaches and may cause them to fail (Dey & Haque, 2009). Thus, classifiers should also be tested against noisy data sets (If noisy data is likely to be encountered in the application context). Scalability denotes the classifiers ability to train efficiently given large amounts of data (Han et al., 2006). This can be assessed by evaluating how the other measures change if the amount of training samples changes, ceteris paribus (Khuc et al., 2012; J. Lin & Kolcz, 2012). Interpretability, or comprehensibility (Taylor et al., 1994), denotes the extent to which a human is able to extract (even by technical means) understanding and insight from the model of the classifier (Han et al., 2006; Taylor et al., 1994). For example, SVMs using linear mappings are easier to interpret than SVMs using non-linear mappings. 2.5.2 Comparative Evaluation
To preselect classification techniques for implementation in DOX, the collected literature has been assessed. It has been found that the works typically evaluate the quality of classifiers in the accuracy dimension, however rarely in the other dimensions. Thus, the selection of classification techniques is based on the assessment of the reported results in the accuracy dimension. This is considered to be sufficient, as speed and scalability of a classifier is dependent on the implementation. Thus the evaluation results along those two dimensions for a particular classifier type may vary depending on the algorithm design and program implementation, and therefore cannot indicate the speed of a classifier in another implementation which differs (Although the results can be useful as benchmarks to compare implementations, obviously). To compare the evaluation results in the accuracy dimensions, experiments need to satisfy three conditions (Tang et al., 2009):
28
The definition of Speed in (Taylor et al., 1994) is limited to (a), while (b) is referred to as the separate definition of Time to learn
30
1. Evaluation occurred on the same collection (i.e. documents and categories are identical); 2. Split between training and test set is identical; 3. The same evaluation measure has been used, and if the evaluation was parameterized, the same parameters have to be used. Unfortunately, these three conditions are seldom satisfied. In the surveyed works, experiments vary in evaluation methodology, data sets, and in the sentiment analysis process. In order to be able to still draw some conclusions, the three conditions have been loosened and all works which satisfy the following conditions have been included in the survey: All works which use either accuracy, precision, recall or the F1 measure to measure accuracy with either random sampling (RS; i.e. split in training and test set) or cross validation (CV) All works which evaluate at least two classifier configurations All works which use recognized data sets (Accepted in the community; such as Pang et al.s (Pang et al., 2002) IMDB data set) or a generally accepted methodology to generate data sets (Like the emoticon trick). Data sets must consist of regular documents (and not only opinion words). All works which satisfy these conditions have been compiled in Table 10129. The methodologies of the evaluation experiments vary widely, and the data is therefore too heterogeneous to draw meaningful specific conclusions. But the data allows drawing more general conclusions by using more robust measures for comparison and by transparently describing the circumstances which could potentially narrow the validity of the conclusion. The collected data set contains 118 records, collected from nine studies. The data set includes supervised as well as unsupervised classifiers; configuration parameters (if reported) have been included in the data set. The data set also contains a description of the training data sets (type, size; if reported) and of the evaluation methodology (RS, CV). The reported accuracy across all data sets and classification techniques spans an interval from 0,514 (barely better than chance) to 0,907 with a median of 0,81. This is based on the 87 records which recorded the accuracy measure. The largest subset (78 records)
31
used reviews (product, movie, or other) as training and test sets. The accuracy values of this subset range from 0,564 to 0,902 with a median of 0,804. The other subset (9 records) used news articles as training/test data, but because 8 records of this subset were reported in one paper, it has limited analytic use standing alone. In a similar comparative survey, Tang et al. (Tang et al., 2009) report a median accuracy of 0,829 across a dataset of fifteen records from 11 studies across two data sets, with the best performer achieving an accuracy of 0,883 and 0,937 on the sets. Overall, it can be concluded that an accuracy of around 0,8 should be considered as an achievable standard, while an accuracy of around 0,9 can be considered exceptional. In an analysis of the classifiers judged best in paper, 6 classifier relied on a SVM. In a direct comparison of classifiers (Pang et al., 2002), SVMs performed better than Nave Bayes and Maximum Entropy. This is also validated by the analysis of Tang et al. (Tang et al., 2009). Following this analysis, the initial idea was to include a Nave Bayes classifier and a SVM classifier in DOX. The former to provide a baseline for comparison, and the latter because it seems, from the data at hand, that it is currently the best classification technique. However, due to the choice of Mahout as the Machine Learning subcomponent in the prototype, SVMs were not an option as they have not been implemented in the current version (v0.7). A classifier has been implemented on the basis of Mahouts implementation of Logistic Regression, trained by Stochastic Gradient Descent. Logistic Regression is considered to approximate SVMs fairly well (Zhang et al., 2003) while being more efficiently trainable (Bottou, 2010).
32
3
3.1
Opinion Extraction
Opinion-Oriented Information Extraction
This chapter will explore the topic of opinion-oriented information extraction (OOIE), or opinion extraction (OE) for convenience. Both terms refer to information extraction problems which distinguish opinion-oriented information extraction as a special field of general information extraction (IE) (Pang & Lee, 2008; Tang et al., 2009). OE tasks encompass all tasks necessary to retrieve opinionated documents and extract opinion information from them (Pang & Lee, 2008; Tang et al., 2009). This chapter will therefore explore from where and how opinionated documents can be gathered, what preparatory steps can and need to be applied to the data in order to perform sentiment classification and which other pertinent information other than sentiment are contained in an opinion. The simplified process is illustrated in Figure 3 below:
Sources
Data Import
Preparation
Extraction
Postprocessing
Storage
Figure 3: Simplified extraction process A collection component gathers opinionated documents from pre-determined sources. Raw documents are fed through a preparation stage which transforms data to a suitable format for extraction. In the extraction stage opinion information, sentiment orientation and opinion context is extracted (in the form of data). Post processing transforms the generated data into a format for analysis. Over the course of this chapter, the constituent elements of this process are introduced, which will refine this simple process. At the end of this chapter an overall framework is presented. 3.2 Context Extraction
Before moving on to other matters, it is relevant to know what other items beside sentiment orientation can be extracted from the documents. If one is inclined to view the sentiment orientation of an opinion as an opinions payload, its central content, then there exist a number of attributes which give the sentiment context (Kamvar & Harris, 2011). Important Context attributes of an opinion are the opinion holder and the subject of the opinion (Kamvar & Harris, 2011; B. Liu, 2010; Pang & Lee, 2008; Tang et al., 2009). Other context attributes which may be extractable are the time of expression, the
33
location, or attributes of the opinion holder (such as gender and age) (Kamvar & Harris, 2011; B. Liu, 2010). 3.2.1 Subject and subject-feature extraction
As an opinion is always expressed in respect to a subject ( 1.1), it is necessary to identify and extract this subject to attribute the opinion and the sentiment contained within properly (Binali et al., 2009; Nasukawa & Yi, 2003; Pang & Lee, 2008; Tang et al., 2009). A subject can also very well consist of a number of attributes or parts, which in this context are called features of the subject (Tang et al., 2009). To avoid confusion with the term features which refers to the input of a classifier the term subjectfeatures will be used in this thesis when discussing features of a subject. A subject feature is a term that must satisfy one of the following relationships (Tang et al., 2009): (1) A part-of relationship with the subject, (2) An attribute-of relationship with the subject, or (3) An attribute of relationship with a known feature of the subject. To illustrate, consider the following example sentence taken from (Tang et al., 2009, p. 10768)29: I found the Sony Ericsson W810i is a good mobile phone. The keypad is easy to use and texting is very simple as the buttons are small but well defined. The screen is bright and clear with good resolution. In the example sentence the term Sony Ericsson W810i is the subject of this opinion, while keypad and screen are subject-features (Tang et al., 2009). Subject extraction is very much necessary for detailed types of sentiment analysis like aspect-oriented sentiment analysis and comparative sentiment analysis. Aspect-oriented sentiment analysis identifies sentiment directed at subject-features and treats it separately allowing detailed presentation of how the overall sentiment toward a subject is influenced by the sentiment toward its constituent parts (Pang & Lee, 2008). This type of sentiment analysis is highly applicable in a marketing context, as it allows analyzing how customers perceive and judge features of a product. Comparative sentiment analysis attempts to identify relationships between subjects and analyses sentiment directed at those subjects in a way to allow comparative conclusions (Xu et al., 2011). In a hypothetical discussion between the fans of Apple and Android
29
Emphasis added
34
smartphones a comparative analysis would have to identify both subjects and an opposing relationship; furthermore it would present sentiments discovered in a way which allows to draw comparative conclusions (e.g. users are 20% more likely to express positive sentiment when talking about Android devices compared to Apple devices would be an exemplary conclusion). Comparative analysis therefore is also is very applicable to a marketing context (Competitive Intelligence) (Xu et al., 2011), but also in other competitive contexts such as politics for example. In (Xu et al., 2011), a system is presented which is able to mine products reviews and extract products, their features, and associated opinions using conditional random fields (CRF (Lafferty, McCallum, & Pereira, 2001)). The system summarizes and visualizes the sentiments on a product/feature basis in a way which allows making comparative conclusions.
Subject Sentiment Classification
Subject Extraction
Subject-Feature Extraction
Feature Sentiment Classification
Figure 4: Subject and Subject-Feature Extraction Workflow (Binali et al., 2009) Extraction of subjects depends on techniques which can parse text to identify the subject (B. Liu, 2010; Pang & Lee, 2008; Tang et al., 2009). Techniques that can be employed are simple heuristics (M. Hu & Liu, 2004a; Yi et al., 2003); relationship discovery (Popescu & Etzioni, 2005); probabilistic models (Lafferty et al., 2001; B. Liu, 2010; Mei, Ling, Wondra, Su, & Zhai, 2007; Xu et al., 2011), and association rule mining (M. Hu & Liu, 2004a; B. Liu, Hu, & Cheng, 2005). Therefore one can divide the techniques in two classes: Symbolic approaches which require syntactic descriptions of terms and statistical/ML-approaches (M. Hu & Liu, 2004b). Attribution of sentiment towards subjects and subject features can be based on vicinity (S.-M. Kim & Hovy, 2004), a lexicon of rules (or patterns) as in (Morinaga & Yamanishi, 2002; Popescu & Etzioni, 2005), or text parsing using NLP techniques like POS-tagging and syntactic parsing (Tang et al., 2009). While the former approaches allow to perform subject identification separately from each other, a necessary condition to use these methods for filtering, subject and sentiments may also be extracted jointly (Mei et al., 2007).
35
Full discussion of subject extraction is beyond the scope of this thesis. The interested reader is referred to consult (B. Liu, 2010; Tang et al., 2009) which contain more detailed discussions on the current state of subject-identification. 3.2.2 Opinion holder extraction
For many applications, the attribution of opinions to opinion holders and the extraction of opinion holders is a relevant task (B. Liu, 2010). Opinion holders are often fairly easily extractable, especially if the documents are structured in some way, like newsgroup messages, discussion posts, blog posts, or documents retrieved with the Twitter API (Cheong & Lee, 2010; Dini & Mazzini, 2002; B. Liu, 2010). However in some source-types, for example news articles, extraction is difficult because opinion holders are embedded in the opinion expression (B. Liu, 2010; Pang & Lee, 2008). News articles, and other such documents, also often contain opinions from more than one opinion holder (Pang & Lee, 2008). A number of approaches towards opinion holder extraction have been developed: Choi et al. (Choi, Cardie, Riloff, & Patwardhan, 2005) employ a hybrid approach combining a sequence-tagging approach using linear-chain CRFs and heuristic extraction patterns. CRFs also form the base of the work of a subsequent paper, which attempts to extract opinion holders while simultaneously matching them to their respective opinion expressions (Choi, Breck, & Cardie, 2006). Kim and Hovy (SM Kim & Hovy, 2006) perform these task subsequently, first identifying subjective text then attempting to attribute it to an opinion holder. Their employed method for the latter task is a Maximum Entropy (ME) model to identify the most likely opinion holder. Identifying opinion holders also allows connecting sentiment to the context of the opinion holder. Context attributes of the opinion holder can be gender, age, location, or nationality (at the time of the expression) (Kamvar & Harris, 2011). Context attributed by parsing web pages (Dini & Mazzini, 2002) or by querying meta-data on opinion holders directly through APIs (Cheong & Lee, 2010). Location information can often be extracted or deduced (Kamvar & Harris, 2011). This allows comparing sentiment between opinion holders in different geographical areas (Kamvar & Harris, 2011). As stated before, location information can often be extracted by identifying attributes of the opinion holder.
36
3.2.3
Other metadata
Time is another extractable attribute which is relevant to identify the time of expression, especially if one wants to compare sentiment dynamics over the course of time or between time frames (Dini & Mazzini, 2002; Kamvar & Harris, 2011) 3.3 Sources
Analysis of opinionated documents requires one or multiple sources for those documents. For the purpose of this thesis, a source is defined as a particular repository of (at least partially) opinionated documents, like a website or a micro-blogging service. Sources are not time-invariant and may add, remove, or edit documents over the course of time. Sources may also have certain additional properties like an identifying label or name (like an URL or a recognized name like Twitter) and are associated with one or multiple creators of content. Source-types in this context refer to a set of sources which exhibit certain shared properties and can be thought of as categories. Such shared properties may be writing style, degree of subjective content, or domain-orientation. Common source-types of opinionated documents in the surveyed works are Blogs Websites which display postings by one or more individuals which may also contain the option for other individuals to add comments; e.g. Wordpress30. Used in (Attardi & Simi, 2006; Godbole & Skiena, 2007; Harb, Planti, & Dray, 2008; Kobayashi et al., 2007; Ku, Liang, & Chen, 2006; Y. Liu et al., 2007; Melville, Ox, & Lawrence, 2009; Mishne & Glance, 2006; Mishne & Rijke, 2006a; Missen, Boughanem, & Cabanac, 2010; Tsai, Shih, Peng, & Chou, 2009). Customer Feedback Information from customers regarding satisfaction/dissatisfaction with a particular product/service. Can come in the form of oral surveys, emails, letters, or through other web-based means. Used in (Dini & Mazzini, 2002; Gamon & Aue, 2005; Gamon, 2004; D. Lee et al., 2008; Oelke et al., 2009). Message Boards/Forums/Newsgroups Online discussion sites which allow individuals to exchange messages; e.g. Yahoo! Message Boards31.
30 31
http://wordpress.com/ http://messages.yahoo.com/
37
Used in (Abbasi & Chen, 2007; Chauchat, Eric, Lumire, & Mends-france, n.d.; Sanjiv R Das & Chen, 2001; Dini & Mazzini, 2002; Kaiser & Bodendorf, 2009; N. Li & Wu, 2010). Micro-blogging services Similar to blogs but postings are limited to a smaller size; e.g. Twitter32. Used in (Akcora et al., 2010; Bae & Lee, 2011; Bifet & Frank, 2010; Cheong & Lee, 2010; Diakopoulos & Shamma, 2010; Laniado & Mika, 2010; J. Lin & Kolcz, 2012; OConnor et al., 2010; Pak & Paroubek, 2010; Parikh & Movassate, 2009; Thelwall, Buckley, & Paltoglou, 2011; Wang et al., 2011). News items News items are documents created by news organizations containing information about recent events; e.g. Reuters33, New York Times34. News documents have been used in (Balahur & Steinberger, n.d.; Godbole & Skiena, 2007; Ku et al., 2006; Wanner et al., 2009; Yi et al., 2003). Reviews A review is a document which contains an opinion toward a certain entity, like a product or work of art; e.g. IMDB35, Yelp36. Reviews have been used in (Chaovalit & Zhou, 2005; Dave & Lawrence, 2003; Denecke, 2008; Gamon & Aue, 2005; M. Hu & Liu, 2004a, 2004b; Kobayashi et al., 2007; C. Lin, Road, & Ex, n.d.; Mullen & Collier, 2004; Ohana & Tierney, 2009; Pang et al., 2002; Popescu & Etzioni, 2005; Prabowo & Thelwall, 2009; Turney, 2002; Xu et al., 2011; Yi et al., 2003; Zhuang, Jing, & Zhu, 2006). Social Networks Web services which allow the formation of social communities, often characterized by the existence of relationship graphs between users, profile pages, and methods of interaction. (Micro-blogging services and Blogs can also be considered social networks if they fit the definition); e.g. Myspace37 or Youtube38. Used in (Bermingham et al., 2009; Cheong & Lee, 2010; Jamali & Rangwala,
32 33
http://twitter.com/ http://www.reuters.com/ 34 http://www.nytimes.com/ 35 http://www.imdb.com/ 36 http://www.yelp.com/ 37 http://www.myspace.com/ 38 http://www.youtube.com/
38
2009; Kamvar & Harris, 2011; N. Li & Wu, 2010; Prabowo & Thelwall, 2009; Zafarani, Cole, & Liu, 2010): The selection of sources of opinionated documents is an important issue in opinion extraction. Sources have certain properties which influence their documents such as stylistic differences and domain-orientation (Pang & Lee, 2008; Tang et al., 2009). Blogs often contain opinionated text from at least one opinion holder, but may contain opinions from multiple authors. A single blog is often devoted to specific subjectdomain (for example fine food, politics, or special issues like gun control)(Abbasi, Chen, & Salem, 2008). Customer feedback collected from web surveys on the other hand are regarded as noisy, fragmentary, incoherent, and often quite short (Dey & Haque, 2009; Gamon, 2004). Content in which words tend to deviate from their accepted and recognized form make classification harder (Dey & Haque, 2009; Gamon, 2004). Examples of deviations from the accepted common term nice are niiice (with is added for emphasis), n1ce (colloquial), or naice (spelling error). Data is considered to be noisy, if a large part of the data contains non-subjective content or not-classifiable content (e.g. incoherent text) (Gamon, 2004). Noisy text data often contains spelling errors, uncommon abbreviations, uncommon use of capitalization, and grammatical errors (Malformed sentences, incorrect punctuation, etc.) (Dey & Haque, 2009). The class distribution in customer feedback data also tends to be skewed toward instances of the negative class (Gamon, 2004), which as discussed in 2.3 can have a negative impact on classifier training. News articles can also contain subjective content (Balahur et al., 2009; Pang & Lee, 2008). But a news article may at once discuss multiple subjects and may contain opinions expressed by individuals other than the articles author (Pang & Lee, 2008). Typically subjective text in news article are quotes from individuals involved in the event (like witnesses or politicians) (Balahur et al., 2009; Pang & Lee, 2008). This elevates the importance of identifying opinion holders and attributing an opinion correctly them (Pang & Lee, 2008). Message boards, forums and newsgroups are like blogs also often dedicated to a specific domain (such as finance (S. R. Das & Chen, 2007)) (Abbasi et al., 2008). They contain opinions of multiple authors. And it has been observed that sentiments in messages sent in reply to a preceding message are often antagonistic (Agrawal & Rajagopalan, 2003).
39
Reviews are often regarded as well-formed and coherent documents which contain at least one paragraph of text (Gamon, 2004). That is however not commonly given, and reviews may very well also be noisy depending on the source (Dey & Haque, 2009). An advantage or reviews is that opinions can usually easily attributed to one opinion holder (Pang & Lee, 2008). Reviews will usually focus on one particular subject (Pang & Lee, 2008), but identification of subject-features may be important (Pang & Lee, 2008). It has been noted that sentiment analysis of movie reviews is challenging (Abbasi et al., 2008). Suggested reasons for this phenomenon are the inclusion of lengthy plot summaries in reviews as well as the use of difficult to parse linguistic devices such as rhetoric and sarcasm (Abbasi et al., 2008). Subject-specific / Domain-specific sentiment classification performs better than general approaches (Balahur et al., 2009; Dave & Lawrence, 2003; Pang & Lee, 2008). Cross domain application of a classifiers trained in one domain may fail in another: In (Balahur et al., 2009) training a SVM on the Emotiblog corpus and testing it on newspaper quotations did not produce satisfactory results (However, it should be noted that the authors failed to compare the results to an SVM trained and tested on data from the same topic-domain and used a large amount of features possibly leading to noise pollution.). 3.4 Data Collection
An automated systems requires a method to gather the necessary documents (Balahur et al., 2009). Two types of colection methods are frequently used: Crawlers and Application Programming Interfaces (API) from service providers. Crawlers Crawlers are automated systems whose main objective is to quickly and efficiently gather as many useful web pages as possible (Manning et al., 2009, p. 443) in a process referred to as web crawling (Manning et al., 2009). Web crawlers parse web sites and their link structure, which enables them to discover new websites iteratively by following the links on a fetched page (Manning et al., 2009). A seed list of known URLs is required to start that process (Manning et al., 2009). Crawlers also commonly extract meta-data from fetched websites and parse their content (primarily for hyperlink extraction) (Manning et al., 2009). This makes it possible to push some necessary data cleaning into the crawler, for example stripping of html tags and storing plain text in an accessible format (Dini & Mazzini, 2002). However, it must be ensured that no data necessary to extract relevant information gets purged before it has been extracted (Dini & Mazzini, 2002). Therefore crawlers are a useful tool for the import of web pages.
40
Crawlers have to meet a minimum set of requirements (Dini & Mazzini, 2002): If used in conjunction with an existing search engine(s) it has to be able to (1) issue queries to the engine(s); (2) fetch (download) the pages from the result list. If one wants to discover more content using the retrieved pages as a springboard, the crawler has to be able to follow the embedded link structure (Dini & Mazzini, 2002). The latter step is advised, because it has been observed that opinion Websites often link to each other (Efron, 2004). But one should note that sites linking heavily to each other often share similar sentiments (Efron, 2004). This fact suggests that one is advised to start the crawling process with a balanced seed list (no prior expectation of a biased selection) to acquire a sample which reflects the real distribution of sentiment. Crawlers have been used in (Dave & Lawrence, 2003; Dey & Haque, 2009; M. Hu & Liu, 2004a) to fetch reviews; The We Feel Fine sentiment search engine (Kamvar & Harris, 2011) uses a crawler to fetch documents from a list of URLs which contains blogs, micro-blog feeds, and pages with public social network messages, applying different crawling methods depending on the source; In (Kobayashi et al., 2007) a search engine was used to automate the collection process for domain-specific training documents based on a seed word list; In (S. R. Das & Chen, 2007; Ding et al., 2009) to retrieve messages from web message boards; And in (Balahur et al., 2009) to automatically gather news items. APIs Application programming interfaces (API) are a set of commands, functions, and protocols through which a service provider can allow third parties to access their services in a standardized way. A number of services which allow individuals to create opinionated content provide such APIs to allow third parties to query and retrieve said content and additional metadata. Examples are MySpace39 and Twitter40, both of which allow sending queries for specific data, providing access to the status stream, and to access public user metadata. In the surveyed works, Twitter was the most frequently used API. It was explicitly used in (Bae & Lee, 2011; Bifet & Frank, 2010; Cheong & Lee, 2010; Diakopoulos & Shamma, 2010; OConnor et al., 2010; Pak & Paroubek, 2010; Parikh & Movassate, 2009). Twitter provides two APIs of interest for sentiment analysis: The Streaming API (Twitter Inc., 2012a) and the REST API (Twitter Inc., 2012b). The Streaming API provides access to the real-time stream of twitter status updates (Twitter Inc., 2012a). It provides access to three distinct streams: Public streams, user streams, and site streams.
39 40
http://developer.myspace.com/ https://dev.twitter.com/
41
The public streams is the most relevant for opinion mining applications as it provides access to real-time stream of twitter status updates (tweets) and provides methods to filter the stream by using filter predicates. One can therefore filter41 the status stream to only retain status updates from specific users, or which include specific keywords, or come from specific locations (Cheong & Lee, 2010). The communication protocol is HTTP and answers are returned either in XML or JSON. The REST API includes Twitters Search API, which provides an interface to query for specific items, like all tweets including a certain keyword It should be noted that the Search API indexes (and therefore retrieves) only tweets for up to nine days into the past (Twitter Inc., 2012c). It is also not designed to retrieve all tweets, but rather the most relevant (Twitter Inc., 2012c). The REST API also provides access to (public) user metadata (Twitter Inc., 2012b). The twitter API can be employed to compile labeled training data by using the emotion trick ( 2.3.3) (Bifet & Frank, 2010; J. Lin & Kolcz, 2012; Pak & Paroubek, 2010). An example of the operational use of the Twitter APIs is given in (Cheong & Lee, 2010): In an attempt to track civilian sentiment in response to catastrophic events (The application is crisis management in this case) they employ many function of the API. The Trending Topics API (Part of the REST API) is used to discover trending topics (most talked about). If one of those topics is identified to refer to a catastrophic event, the Streaming API is used to track the event in real-time while past tweets regarding the topic are retrieved by using the Search API. The User API is used to provide meta-data. APIs can also be integrated in the crawler as in (Kamvar & Harris, 2011), where sourcespecific APIs are used to retrieve content. 3.5 Preprocessing
Input data for classifiers can be subjected to a number of preparatory steps in order to improve the classifiers accuracy, scalability and general efficiency (Han et al., 2006): This is referred to as data preprocessing, which typically encompasses the steps data cleaning, and data transformation and reduction. Data quality of input data which contains noise or incomplete records can be improved through noise-reduction and missing-value-estimation techniques. These kinds of preprocessing steps are referred to as data cleaning (Han et al., 2006). Data may also be subjected to transformations and reductions (Han et al., 2006). While the above steps have to be executed on the data, relevance analysis refers to the process which reduces the required features a classifier
41
For more, see https://dev.twitter.com/docs/api/1.1/post/statuses/filter
42
needs to fulfill its task by selecting a smaller set of features without reducing the classifiers accuracy (much) as a trade-off to its efficiency (Han et al., 2006). 3.5.1 Transformation
Stripping data format tags Raw documents fetched by the import component usually contain mark-up such as HTML-tags (websites) or are formatted in some kind of proprietary format (APIresponses), the latter often based on XML or JSON (see 3.4). The unnecessary mark-up has to be removed (S. R. Das & Chen, 2007; Dini & Mazzini, 2002). However, care should be given to ensure that no relevant information will be removed (Dini & Mazzini, 2002). Some symbols might indicate special items relevant for the context, such as mark-up indicating originators of a message or date/time information (Dini & Mazzini, 2002). In addition, some symbols might carry relevant sentiment information, for example emphasis tags in HTML (<strong> </strong>). Annotating / Tagging Some approaches enrich the documents with additional information (M. Hu & Liu, 2004a). This is usually accomplished by adding annotation marks, or tags, to the document (M. Hu & Liu, 2004a). Later steps can then translate the information into features (M. Hu & Liu, 2004a). Examples of annotation are the annotation of POSclasses (M. Hu & Liu, 2004a); or the annotation of subjects, subject-features, or opinion-holders. POS annotation requires the use of a POS-parser, which classifies words of sentences into POS categories (M. Hu & Liu, 2004b). An example sentence with annotated POS-tags, as it appeared in (M. Hu & Liu, 2004a, p. 2), is shown in Figure 5 . The example sentence I am absolutely in awe of this camera is properly annotated. The Tags <NG></NG> and <VG></VG> represent syntactic chunks, noun groups and verb groups respectively, while words are annotated with the tags <W></W> which carry POS-information as attributes. POS-tagging has also been used in (Abbasi et al., 2008; Bermingham et al., 2009; S.-M. Kim & Hovy, 2004). Other examples of additional information that can be annotated are named entities, like opinion holders and subjects (S.-M. Kim & Hovy, 2004), valence shifters, most importantly negation, (S. R. Das & Chen, 2007)
43
<S> <NG> <W C='PRP' L='SS' T='w' S='Y'> I </W> </NG> <VG> <W C='VBP'> am </W> <W C='RB'> absolutely </W> </VG> <W C='IN'> in </W> <NG> <W C='NN'> awe </W> </NG> <W C='IN'> of </W> <NG> <W C='DT'> this </W> <W C='NN'> camera </W> </NG> <W C='.'> . </W> </S>
Figure 5: POS Annotation Splitting / Chunking / Tokenizing Plain text often needs to be divided into smaller text units in a process called splitting. If sentiment analysis is to be performed on a sentence level, splitting the document into sentences is a necessity. This can be done by using heuristics (such as splitting on punctuation signs), but it should be properly done by using a sentence splitter which utilizes a suitable language model. The sentence splitter of the OpenNLP toolkit is an example of a suitable splitter, as it is based on language-dependent NLP models. The process of splitting text into a set of words is called tokenizing (Dini & Mazzini, 2002). A tokenizer should also employ a suitable language model to appropriately split text into words. If words are split into groups with some semantic meaning, then this process is called chunking (Dini & Mazzini, 2002; M. Hu & Liu, 2004a). Syntactic chunking is the process which produces chunks based on syntactic rules, producing, for example, word groups consisting of verbs and nouns (M. Hu & Liu, 2004a). Canonical Representation Reduction A concept of IR is the reduction of words to a single canonical representation, a term. This is the process in which either a number of different inflections of a word are mapped to representative of the group or where for different, uncommon spellings, or errors is accounted for. Stemming and lemmatization are methods to reduce words to a base form, which maps multiple variants to one representation (Abbasi et al., 2008; Bermingham et al., 2009; M. Hu & Liu, 2004a). Fuzzy matching can be used to compensate for uncommon spellings and errors (M. Hu & Liu, 2004a). Another issue is the mapping of abbreviations to the correct term (S. R. Das & Chen, 2007).
44
Vector encoding Classifiers require input in a certain format, most commonly in the form of feature vectors. An appropriate encoder translates documents into feature vectors. An encoder can for example encode term-presence in a binary vector, or term frequencies in realvalued vector. 3.5.2 Filters
Some data cleaning tasks are designed to remove noise from the input before they are fed to the classifier. This noise is irrelevant data, which can have an adverse impact on the quality of the classifier, or its training speed. For the purpose of this thesis, these can be referred to as filters. A filter will accept input data and based on a binary decision function will decide whether or not the data will be passed on to the next step. Therefore the output of the filter will have at most the size of the input. Filters can be applied on a document level, where unwanted documents are removed from the input, on a part or sentence level, where irrelevant parts of the documents are removed, or on a chunk or word level. Some filters will require a classifier as input for the decision function (Figure 6).
Preceding Step
Classifier
Filter
Figure 6: Classifier-based filter Stopwords On a word level, the removal of irrelevant words is an option (M. Hu & Liu, 2004a). Words that frequently occur in both classes, due to their frequent use in language, are called stopwords. Their removal can be considered to not degrade the sentiment
45
information in a document. Examples for stopwords are the, a, or to. Stopword removal has been used in (M. Hu & Liu, 2004a). Subjectivity Subjectivity filtering in pre-processing is a reduction step. It is based on the assumption that not all documents fed to the classifier contain opinions and that not all of the content of a document which contains opinions is opionated text (Pang & Lee, 2008). The purpose of this pre-processing step is to increase the signal-to-noise ratio by identifying opionated text and to remove text that does not contain any sentiments from the classifiers input data (Tang et al., 2009). Pang and Lee have shown that the extracts retain sentiment information to a degree that is comparable to the unfiltered document (Pang & Lee, 2004). In the surveyed works, it has been shown that subjectivity filtering generally improves results for unsupervised and supervised approaches when employed (Balahur et al., 2009; Xia & Xu, 2007), which is also confirmed by (Mihalcea & Banea, 2007).
Input
Subjectivity Classification Objective Subjective
Remove
Sentiment Classification Positive Negative
Label Positive
Label Negative
Figure 7: Sentiment classification with subjectivity filtering Subjectivity filtering depends on subjectivity detection or, more accurately, subjectivity classification (Balahur et al., 2009; B. Liu, 2010; Mihalcea & Banea, 2007; Pang & Lee, 2008). Therefore it is also a classification task, a problem to which a classifier can be applied (B. Liu, 2010). The categories a piece of text can assume are either subjective or objective (Balahur et al., 2009; Mihalcea & Banea, 2007), which allows the
46
classification function to be defined as mapping a representation of text in the form of a }. One feature vector to the label space, which is the binary set { distinguishes document-level, sentence-level, and phrase-level (sub-sentence level) subjectivity classification (Wilson et al., 2005). As it is a classification problem similar to sentiment classification, subjectivity classification can employ similar techniques as described in section 2.2ff (Balahur et al., 2009; Tang et al., 2009; Xia & Xu, 2007). Early works found that presence of adjectives highly correlates with subjectivity in text (Hatzivassiloglou & Wiebe, 2000). Wilson and Wiebe studied subjectivity extensively and compiled a lexicon of subjectivity clues which contains now over 8000 subjectivity clues (Wilson et al., 2005; Wilson, Wiebe, & Hoffmann, 2009). A rule-based approach to subjectivity detection based on this lexicon of subjectivity clues has been successfully employed in (Balahur et al., 2009). Subjectivity classification with supervised ML classifiers has been successfully demonstrated in (JM Wiebe et al., 1999; Xia & Xu, 2007). Subjectivity classification has been characterized as more difficult than subsequent polarity classification (Mihalcea & Banea, 2007, p. 977) which can be seen as a reason why sentiment classification generally benefits from preceding subjectivity filtering (Mihalcea & Banea, 2007; Pang & Lee, 2008). As a full review of subjectivity classification is out of the scope of this thesis, the interested reader is encouraged to consult (Tang et al., 2009). 3.5.3 Subject
The identification of a subject (topic) can also aid in filtering the input: The base assumption that all documents fed to the classifier are relevant to the subject of interest, possibly because a search engine was used to compile documents mentioning the topic, may in some cases not hold true (Pang & Lee, 2008). Therefore, filtering on a document level and exclusion of documents which do not contain the subject can reduce the workload of the classifier and remove unwanted documents from consideration (Pang & Lee, 2008). Analogue to subjectivity filtering this process can be referred to as subject filtering42. Documents may also contain opinions on multiple subjects (Pang & Lee, 2008). One can easily imagine a discussion between lovers of Apple and Android smartphones and the resulting exchange of sentiments regarding those devices. An investigator only interested in the sentiment toward Android smartphones needs to take care to distinguish between sentiments based on their subject to avoid false attribution (Pang & Lee, 2008; Tang et al., 2009). As a matter of fact, passages only discussing Apple phones may therefore be irrelevant. In this case, such off-subject passages (or sentences) can also be removed from the input to the classifier by subject-filtering on a paragraph- or sentence-level (Pang & Lee, 2008).
42
Or topic filtering (Pang & Lee, 2008)
47
3.6
Opinion Mining Framework
Taken all together, a more specialized framework for opinion mining applications emerges (Figure 8). The typical system is divided into three modules: Collection, analysis, and presentation. These modules are connected through a process flowing downward from the collection module to the presentation module. The process flowing through the analysis module (the analysis process) is divided into three subordinate processes: Preprocessing, extraction, and postprocessing.
Sources
Metadata
Crawler
Content
API
Collection Collection
Data Import
Preprocessing Preprocessing
Preprocessing
Analysis Analysis
Extraction Extraction
Opinionholder Extraction Opinion holder
Subject / -Feature Extractor Subject
Sentiment Classifier
Metadata Extraction Source Time Location
Orientation
Subject-Feature
Postprocessing Postprocessing
Postprocessing
Presentation Presentation
Presentation
Figure 8: Opinion Mining Framework
The process is executed in the following way: The data collection module is tasked with retrieving documents from a set of sources. These sources are either predefined, or discovered through a crawling process. A data import module will most likely include a crawling engine to fetch webpages and/or a mechanism which utilizes provided APIs to gather opinionated content.
48
Fetched documents will be passed on to the analysis process. The analysis process will consist of a number of processing steps which are preprocessing, extraction, or postprocessing steps. The hierarchy of generic processing steps is shown in Figure 9. Preprocessing and postprocessing steps are defined by their relative position to an extraction step: Preprocessing steps are transformative steps preceding an extraction step, while postprocessing steps are transformative steps following an extraction step.
Process Step
Preprocessing
Filter
Transformation
Tokenizer
Vectorizer
Converter
Extraction
Opinionholder Extraction
Subject / -Feature Extractor
Metadata Extraction
Postprocessing
Figure 9: Common generic process steps Preprocessing steps will clean, reduce, and transform gathered data into a suitable input format for one or many classifiers. Depending on their task, preprocessing steps can be divided into filters, and transformations. Filters reduce the input data by removing unwanted parts of the input (e.g. stopwords or objective content). Transformations perform transformative actions on the input and can be divided into tokenizers, vectorizers, and converters. Tokenizer split input records into one or more output records based on a pattern, for example a word tokenizer will split input records into output records each containing a single word of the input record. Vectorizers will create a vector representation of the input by encoding it appropriately for the classifier. Vectorizers therefore perform a mapping from text space to a multidimensional quantitative space. Converters transform input records into another form, for example by adding annotations like POS-tags. Preprocessed data is then consumed by extraction steps. These are steps which will identify certain characteristics in the input data, like sentiment orientation or named
49
entities like opinion holders, and extract them. They produce the extracted data as output. Extraction steps common in opinion mining applications are opinion holder extraction, subject and subject-feature extraction, metadata extraction, and of course sentiment orientation extraction (sentiment classifier). After all necessary data items are extracted a post-processing component combines the data into a format suitable for storage, analysis and presentation. Diversions from this linear process flow are possible: Some preprocessing steps may be dependent on certain extracted data or classifier outputs, like the subjectivity filter or the subject filter, creating an optional feedback into the preprocessing process. Please note that all classification and extraction components are independent from one another in the framework presented in Figure 8. In this case, they can be executed logically separate from each other. However, some works have discovered advantages in performing some extraction tasks jointly (C. Lin et al., n.d.; Qiu et al., 2009). In this case, a joint classifier/extractor will replace the separate ones. Concluding, the presented framework formulates a design of an opinion mining system consisting of three components: Collection, analysis, and presentation. It formulates a strict flow of data through these three major modules. While these can be assumed to be fairly static elements of design, the framework allows the formulation and design of configurable analysis processes by imposing a general hierarchy on the analysis steps and by defining generic roles for these steps, yet allowing these steps to be freely configured according to their roles. This allows analysts to define many valid analytic workflows, which is believed necessary as the field is evolving and no optimal technique is known. An advantage of this framework is the abstraction from any implementation. The framework can be used to define logical processes without regard to implementation concerns, while an implementer can built a system guided by this framework. If the system implements the components of the framework, it will be able to execute the defined processes. For demonstration purposes, DOX has been implemented on this premise (Chapter 6).
50
4
4.1
Tracking, Diffusion, and Summarization

Overview
This chapter will discuss the analysis of dynamics and diffusion of opinions and the representation of opinions in summarized form. This will be the foundation of an analysis proposal designed to identify opinions as dynamic entities and represent them accordingly. 4.2 Diffusion
In the preceding chapters, the extraction of opinions and their classification into sentiment orientation classes have been discussed. In those chapters, the focus was on how the current state of opinions in a document corpus could be analyzed and represented. However, opinions emerge, get shared and discussed, and fade from view. An interesting research question here is then to further investigate these dynamic properties of opinions. Can the lifecycle of opinions be tracked? Can patterns be predicted? Can the propagation from one opinion holder to another be predicted? Research in this direction concentrates on three areas: Analysis of the social network and actions between the individuals in this network, analysis of temporal properties of opinions, and the identification of semantically identical opinions with limited or none explicit identification features. 4.2.1 Social and action structure
The interaction between individuals has been used to study the propagation of opinions, sentiments, news, and actions in social networks and blogs (Cai et al., 2011; Goyal, Bonchi, & Lakshmanan, 2010; Lerman & Ghosh, 2010; D. Li et al., 2012; Zafarani et al., 2010). A central role in these approaches is taken by the social graph (Goyal et al., 2010; Lerman & Ghosh, 2010), a representation of the network of individuals as a graph. In the social graph, individuals are modelled as nodes while relationships between the users are modelled as edges (Bodendorf & Kaiser, 2009; Goyal et al., 2010; Kaiser & Bodendorf, 2009). In many cases, the social graph is an easily extractable part of the analysed service as users explicitly confirm a relationship (Lerman & Ghosh, 2010), e.g. a user A follows a user B on twitter or accepts a friend request in Facebook. In other cases, the social graph can be created by analysing observable interactions like answering to a message on a forum (Bodendorf & Kaiser, 2009) or linking to a blog entry of another user in ones own (Song, Chi, Hino, & Tseng, 2007; Zafarani et al., 2010).
51
Works in this vein usually attempt to quantify the likelihood that an action (e.g. posting a message with certain content) or a property (like sentiment orientation) is propagated along the social relationships of a user (Goyal et al., 2010; Zafarani et al., 2010). This is referred to as influence, in the sense that user A can influence user B, to whom he is connected, to perform a certain action (Goyal et al., 2010). In this regard, the identification of opinion leaders is of special interest. Opinion leaders are individuals which have a large influence on a number of others (Bodendorf & Kaiser, 2009; Song et al., 2007; Zeng, 2009). 4.2.2 Temporal Analysis
Some works have explored how the variable time influences sentiment orientation using statistical methods applied to time series of measures of interest like the counts of documents classified as belonging to a sentiment orientation class at a given point in time (Marques, Labra, & Rocha, 2012; Mishne & Rijke, 2006a, 2006b). The motivation behind these works is to improve prediction of future values in the time series (Mishne & Rijke, 2006b), to identify patterns and trends (Marques et al., 2012; Mishne & Rijke, 2006a), and of course to identify abnormal or unexpected behaviour (Mishne & Rijke, 2006a). In (Mishne & Rijke, 2006b), the authors estimated the intensity, or frequency, of sentiment orientation classes in a blog corpus in a given time interval using a regression model which included besides frequencies of terms and the total number of posts (both per interval) also the hour-of-day as features. In their work, they discovered that in fact the intensity of some sentiment orientation classes exhibit seasonality and the variable time can contribute to predicting their frequency, e.g. elevation of the frequency of the class excited on weekends compared to weekdays. An implementation of their work, MoodViews (Mishne & Rijke, 2006a), visualizes time series of intensity levels (Figure 14, p. 61) as plotted charts and allows users to visually analyse the temporal dynamics. In addition, users can explore terms associated with a certain interval. The module MoodSignals uses statistical frequency comparisons and burstiness models to identify words and phrases which show unique behaviour for the selected time interval. 4.2.3 Distinguishing opinions semantically
A much harder task is to identify semantically similar opinions, or at least to identify if the prevailing opinion has changed in some important way. The advantages of such an approach however are quite obvious: Without the requirement of explicit indicators, meaningful larger opinion patterns could be identified in document collections without an explicit relationship like a link structure between documents or message/reply
52
patterns. This direction of research clearly shares some common terrain with topic detection and tracking (TDT) but it also has to include the analysis of sentiments. One interesting aspect is to determine the points in time when the observed opinion changes, these are called breakpoints (Akcora et al., 2010; P. Hu et al., 2011). Observing these changes allows analysts to identify trends in the stream of observable opinions and allows drawing conclusions toward extrinsic factors and events which led to these changes (Akcora et al., 2010). Definition43: Given a stream of documents { | { }}, a breakpoint is
{ } when a decisive change or significant a point in time development occurs, for example the beginning, breaking, or ending of key events. An application of the breakpoint approach on opinionated documents is the work of Akcora et al. (Akcora et al., 2010). The authors set out to identify breakpoints in twitter status streams. Given a stream of twitter status updates, they divide the stream in discrete, equally sized time frames and tweets which have been published within a timeframe are associated with it. A breakpoint has occurred between two timeframes if two conditions have been met: The emotion pattern and the word pattern have both changed. In their work, the two conditions applied together proved to be more reliable in eliminating false positives. The authors encode the sentiment orientation of twitter status updates in an eightdimensional Boolean vector, where each position corresponds to an emotion class (Anger, Sadness, Love, Fear, Disgust, Shame, Joy, and Surprise) and assumes the value of 1 if the status update is classified as belonging to that class and 0 otherwise. A centroid vector computed over all status updates in a time interval represents the emotional pattern of this interval. The similarity of two intervals emotion patterns i s the cosine similarity of their centroid vectors (Formula 3): ( ) | || |
Formula 3 Emotion pattern similarity between two intervals The word pattern of an interval is represented through a set space model. The word set representing an interval is the union set of words in the status updates of that interval, after stemming and removal of stopwords. The similarity of two intervals word patterns is the Jaccard Similarity of the two word sets:
43
Adapted from (P. Hu et al., 2011, p. 261)
53
|( |(
) )
( (
) | ) |
Formula 4 Word pattern similarity between two intervals A break is identified in interval if for both emotion pattern similarity and word pattern similarity the following two conditions hold: (1) The similarity between the preceding two intervals is greater than the similarity of the current interval and its preceding interval ( ) ( )
(2) The similarity between the current interval and the following interval is greater than the similarity between the current interval and the preceding interval: ( ) ( )
4.2.4
Open research questions
Research into the dynamics and diffusion of opinions is an important next step in the analysis of opinions in general. It has been shown, that this can contribute to a better understanding and can lead to improved predictions as well as discovery of certain patterns (like events). Two interesting observations can however be made Most works apply their methods to a single source type or source (e.g. social networks like Twitter, or blogs live LiveJournal), and The three research areas are often viewed in isolation, as works concentrate on a particular aspect.
Regarding the first point: While it is natural to concentrate on a single source type or source to validate the respective approaches, as documents of a single source or source type will be more homogenous in many aspects, a logical extension would be to extend these approaches towards being able to analyse documents from multiple source types to discover interaction dynamics between them. In addition, as most works focus on one of the detailed research areas, it seems that there are possibilities to combine techniques.
54
Based on these intuitions, an approach which uses a technique to identify semantically similar opinions could be used to discover an interaction network between multiple source types without explicit signs of interactions (like links for example). This would in turn allow techniques of the section on Social and action structure ( 4.2.1) to be applied on the interaction graph to measure influence of sources toward other sources. In an attempt to provide the first step toward such a system, the ideas described in 4.2.3 regarding the identification of semantically similar opinions are taken up and extended to identify similar opinions in a multi-source and source-type environment. 4.3 Summarization
Opinion Summarization is the task of generating a concise and digestible summary of a (large) number of opinions (H. Kim, Ganesan, Sondhi, & Zhai, 2011). Its aim is to present an aggregated representation of the underlying opinionated document(s) (Pang & Lee, 2008). An opinion summary communicates the important aspects of the summarized document(s) to the reader, thereby allowing readers to be aware of the contents of document collections that are too large to read by a human (H. Kim et al., 2011; Pang & Lee, 2008; Tang et al., 2009). While other works often consider visualizations as part of opinion summarization (H. Kim et al., 2011; B. Liu, 2010; Pang & Lee, 2008), this section will only discuss textual summarization of opinions while visualization will be discussed in the next section ( 4.4). Summaries can in general be characterized by their scope, whether only one document is being summarized or a collection of documents; and by the level of detail, whether the summary is broad or includes summaries of subject-features (H. Kim et al., 2011; Pang & Lee, 2008). On a general level, a distinction is being made between statistical summaries and purely textual summaries (H. Kim et al., 2011; Pang & Lee, 2008). Statistical summaries present a summary in the form of statistical measures computed over the underlying documents (H. Kim et al., 2011; Pang & Lee, 2008). An example of a simple statistical summary would be the count of positive and negative documents (M. Hu & Liu, 2004b; H. Kim et al., 2011). Possible measures are Absolute and relative counts/frequencies of documents in sentiment classes (M. Hu & Liu, 2004b; H. Kim et al., 2011; Pang & Lee, 2008); Mean, mode, median, and quantiles (Gregory et al., 2006; Pang & Lee, 2008).
Statistical summaries are not very difficult to generate, as statistical measures can be computed from the output of the sentiment classifier (H. Kim et al., 2011). Statistical summaries are excellent to summarize the general sentiment (H. Kim et al., 2011).
55
However, one should take care to select measures which faithfully represent the underlying distribution of sentiments (Pang & Lee, 2008). If the distribution of sentiment is skewed (for example biased towards positive sentiments) or controversial (with sentiments concentrated in the extremes) the mean is not a suitable measure to communicate this (Pang & Lee, 2008). It is also better to report relative frequency distribution over classes, as this is easier to comprehend for readers than absolute counts (Cabral & Hortacsu, 2004; Pang & Lee, 2008). If one does so, these measures should be accompanied by a measure of certitude to communicate how much evidence is available to support the assertion (Pang & Lee, 2008). For example, a split between 25% positive and 75% negative sentiments has more descriptive power if it is based on 1000 documents and not on only 4. Textual summaries are generated summaries in natural language written text (H. Kim et al., 2011; Pang & Lee, 2008). They are considered more informative compared to statistical summaries as they allow the reader to gain more insight into the underlying opinions (H. Kim et al., 2011). Automatic summarization techniques employed in the summarization of opinionated documents are derived from work on topic-based text summarization and therefore quite similar, the crucial difference here however is that sentiment orientation influences the summarization process (H. Kim et al., 2011; Pang & Lee, 2008). Techniques are distinguished into two classes: Extractive and abstractive summaries (H. Kim et al., 2011). Extractive summaries are summaries generated by selecting representative text segments from opinion documents and combining them into a summary (H. Kim et al., 2011). Abstractive summaries on the other hand employ NLP techniques to generate summaries from the analysis results without using existing segments (H. Kim et al., 2011). Due to the inherent difficulty of the latter approach, extractive summaries are the prevailing approach in current works (H. Kim et al., 2011). As extractive summaries summarize the underlying documents by selecting representative segments of text, the unit of selection has to be determined in advance. Previous works have attempted to select representative documents (Kawai, Kumamoto, & Tanaka, 2007; Ku et al., 2006), sentences (Eguchi & Kando, 2006; Ku et al., 2006; Mei et al., 2007; Seki, Eguchi, Kando, & Aono, 2005), or words and phrases (Lu, Zhai, & Sundaresan, 2009; Popescu & Etzioni, 2005; Titov & Mcdonald, 2008). Typically an ordered list is generated of the extracted segments, ordered by a superimposed ranking measure and cut-off at a pre-determined threshold (e.g. Top 10 positive and negative sentences). Lists are often divided into sentiment classes, e.g. a list of positive and negative segments (M. Hu & Liu, 2004b; Ku et al., 2006) in the binary case. An order can be imposed by sorting the extracted segments by frequency (M. Hu & Liu, 2004b; Lu et al., 2009) or degree of subjectivity (Ku et al., 2006). In (Eguchi & Kando, 2006;
56
Seki et al., 2005) clustering was used to group similar paragraphs together based on the Euclidean distance of their term feature (TF) vectors. Sentences in each cluster were then ranked by similarity with an overall weighted feature vector of the cluster which included sentiment orientation measures, and the most similar sentence is then selected to represent the cluster. 4.4 Visualization
Using graphic elements to present the results of sentiment analysis in a visual manner is a common technique which can be used to complement statistical and textual summaries or to stand alone (H. Kim et al., 2011; Pang & Lee, 2008). Visualization of analysis results can allow readers to intuitively grasp the results at a glance and can increase the readability of the results, therefore making them more accessible (H. Kim et al., 2011). Visualization techniques used in surveyed works can be distinguished into three classes based on which aspect of the analysis they can visualize most effectively. The three classes are visualization of states, relationships and time series. 4.4.1 Visualization of states
Sentiment analysis often starts with a specific query scope, like How is t he sentiment orientation towards Product X distributed in todays twitter status updates?. This defined scope will produce analysis results which will describe the state of the underlying data, often in the form of statistics like 437 positive status updates and 382 negative ones or 53% positive. Visualization of states in this sense refers to the visualization of the results of a sentiment analysis with a clearly defined scope, disregarding relationships and temporal dynamics. Visualizing static results is probably the simplest and most straightforward way to present analysis results, as in most cases only the statistical measures calculated in the analysis have to be presented in a suitable fashion like bar or pie charts (Pang & Lee, 2008). Pang and Lee distinguish between the visualization of bounded and unbounded summary statistics (Pang & Lee, 2008). Bounded statistics are statistical measures which values are constrained to a determined interval like locality measures of constrained variables (e.g. the average rating or sentiment orientation in system of a fixed number of classes or rating scales) or relative values (like the percentage of negative opinions) (Pang & Lee, 2004). Unbounded statistics are analogously defined as statistical measures which can assume values not constrained by upper or lower bounds, like counts for example (Pang & Lee, 2008). Bounded statistics, by virtue of their bounds, can be visualized using a pre-determined mapping of the values to visual attribute like size, position, or color shading (Pang &
57
Lee, 2004). Proportion of relative positive and negative sentiments can be visualized by a stacked bar (thermometer) (Dey & Haque, 2009; Pang & Lee, 2008), or a pie chart (Yerva, Jeung, & Aberer, 2012) in which the size of differently shaded partial pieces corresponds proportionally to the relative value of positive and negative sentiments. The pieces add up to the whole. This can of course be extended to n-ary sentiment classification schemes (Pang & Lee, 2008; Yerva et al., 2012), as demonstrated in (Yerva et al., 2012) which uses a special pie chart, called a mood map, to visualize the estimated probabilities of the distribution of twelve emotions (Figure 10). The visualization also displays multiple states; the concentric circles represent different days from Monday (outer circle) to Sunday (inner circle).
Figure 10: Mood-Map (Yerva et al., 2012, p. 2499) Gregory et al. (Gregory et al., 2006) use rose plots, a graph type similar to the pie chart, to visualize the average percentage of affective words in a document collection; the words divided in eight affective classes (Figure 11). Gregorys rose plots are also adapted to visualize median and quantiles of the distributions. An advantage of this visualization is that it saves space by arranging the petals (isosceles triangular shapes with a rounded base) around the center while simultaneously showing measures in a consistent relative location, allowing users to compare rose plots more easily. Rose plots are also useful to visualize unbounded statistics (Pang & Lee, 2008), as the petals do not underlie the semantic constraint of forming a unified body, but are interpreted more along the lines of bars in bar charts or histograms. Word clouds, a hybrid of textual summaries and visualizations, display a select set of words which are colored and sized according to functions which map summary statistics to these attributes. In OpinionCloud (Potthast & Becker, 2010), words in web comments are sized proportionally to their relative appearance in the comments while their color is determined by the sentiment orientation of the word (green for positive, red for
58
negative, and grey for neutral). In addition, positive and negative comments are divided and synthesized into different tags clouds, their size proportional to the amount of comments in their class. This visualization method allows users at a glance to roughly identify the overall sentiment by the size of the tag clouds, as well as to identify important words in the comments.
Figure 11: Rose Plot (Gregory et al., 2006, p. 26) Using different color shadings is another possible way to encode bounded statistics (Pang & Lee, 2008). This method allows either to save space if multiple items have to be visualized on a specific area or if other visualization dimensions like size or locality are already reserved to display other variables (Pang & Lee, 2008). A number of works employ color coding to encode sentiment orientation (and intensity) by mapping a statistic value to a specific color value: Gamon et al. (Gamon & Aue, 2005) map the average sentiment orientation to a color shade continuum rang from red for negative, over white for neutral, to green for positive average sentiments; Oelke et al. (Oelke et al., 2009) used a mapping to eight nominal color categories based on intensity of positive and negative comments while additionally encoding the overall count of documents in the size of a rectangle. Sentiment values can also be encoded in the position of graphical elements. In (Wanner et al., 2009) documents are placed higher on a vertical position if they are positive, lower if they are negative, while neutral documents are placed on the baseline. Raw counts or frequency distributions of sentiment orientation classes, which are unbounded, are often visualized by means of bar charts, histograms, or rose plots (Pang & Lee, 2008). These visualization techniques allow comparing the distribution over the classes easily (Pang & Lee, 2008). The usage of bar charts and histograms are fairly common (Dey & Haque, 2009; Ku, Lo, & Chen, n.d.; B. Liu et al., 2005). Notable is the work of (B. Liu et al., 2005), which plots the bars from the baseline vertically
59
upward for positive counts and downward for negative counts, allowing to easily recognize the majority sentiment. 4.4.2 Visualization of relationships
Graphs are frequently used to illustrate relationships. Kaiser and Bodendorf (Bodendorf & Kaiser, 2009; Kaiser & Bodendorf, 2009) use graphs to visualize the analysis of communication patterns in online forums. With nodes representing users and edges communication patterns between the users, they illustrate the communitys communication structure. The users sentiment orientation, as expressed in their messages regarding a certain topic, is analyzed and nodes are colored accordingly. The work of (Gloor et al., 2009), although not a work on sentiment analysis, illustrates how similarly graph network can be created to illustrate connections between websites. The work focused on visualizing networks based on antithetical topics (e.g. Democrat and Republican), coloring the nodes according to the majority topic. But one can see how easily the approach can be transferred to sentiment analysis. One could adopt the approach to illustrate the prevailing sentiment orientation by coloring the node accordingly.
Figure 12: Graphs illustrating communication patterns and sentiment orientation44 A graph-like structure is also employed by Xu et al. (Xu et al., 2011) to visualize the relationship between features of competing products. Their visualization, called Comparative Relation Map (CRM), is a synthesis of graphs and bar charts. Products (Subjects) and their features (subject-features) are represented as nodes in the graph. Edges are drawn from the center node, the main product of interest, to all other product nodes via feature nodes. While product nodes are ordinary (A simple circle), feature
44
(Kaiser & Bodendorf, 2009, p. 130)
60
nodes are colored bars. Their size is proportional to the amount of documents which mention the feature in the product-product context, while parts of the bar are colored proportionally to the size of the sentiment regarding the feature in the underlying documents.
Figure 13: Comparative Relation Map (Xu et al., 2011, p. 753) 4.4.3 Timelines and Trends
Another dimension of interest is the variation of sentiment over time, which cannot be captured in visualizations which only capture a discrete state in time. The general approach in this regard is simply to plot the relevant summary statistic over time (Pang & Lee, 2008). This can be accomplished by plotting a bar charts which plot one (or more) bars for every discrete state defined by a time window (Ku et al., n.d.), by plotting documents as points in a two-dimensional chart with one axis representing the summary statistic (e.g. binary sentiment orientation) and the other time (Missen et al., 2010; Tsytsarau, Palpanas, & Denecke, 2010), or by drawing trend lines (Figure 14) (D. Li et al., 2012; Mishne & Rijke, 2006a; Missen et al., 2010).
61
Figure 14: Trend plots (Moodgrapher) (Mishne & Rijke, 2006a, p. 153) A number of techniques have been combined in the novel visualization technique of Wanner et al. (Wanner et al., 2009) which manages to plot a number of attributes in a single visualization. In their work they analyze the sentiment contained in RSS newsfeeds and plot their results over time, encoding attributes by placement, color, shape, and opacity of symbols (Figure 15).
Figure 15: Visualization technique of Wanner et al. (Wanner et al., 2009, p. 4) A document is represented by bar symbol. Based on the time and date it was published, it is placed on the graphic. A row in the graphic represents a day, and rows are stacked vertically, with the day of the year increasing downward. Time of day increased along the horizontal axis of the row, and the document is placed accordingly.
62
The shape of the document is determined by the underlying topic, while the color is determined by a higher-order topic. All documents are plotted with a low opacity, therefore allowing documents to overlap each other. As the semi-transparent shapes overlap when published at the same point in time, the resulting high opacity area indicates a higher publishing frequency. Sentiment orientation is again plotted by placement: A neutral documents symbol will be placed along the vertical middle baseline of the row; it is raised proportionally to its positive sentiment, and lowered proportionally to its negative sentiment.
63
5
5.1
Towards parallel execution

Overview
The preceding chapters established that opinion mining requires a number of transformative operations on unstructured text ( Chapter 3) and relies heavily on Machine Learning Algorithms to classify extracted text into subjectivity and sentiment classes ( Chapter 2). Opionated documents can be retrieved from a number of sources, ranging from web sites to micro-blogging services and social networks ( 3.3). Recent years have seen a remarkable growth in the amount of data produced in all kinds of domains globally surpassing the zetabyte barrier (Gantz & Reinsel, 2010, 2011), but especially in in the amount of data produced by individuals which are responsible for about 75% of all data created globally (Gantz & Reinsel, 2011), encouraging the growth of social networks like Twitter, Facebook, and many, many others (Gantz & Reinsel, 2011). As the increase of available processable opinionated documents is seen as one factor which lead to the rising relevance, and hence, activity in the field of opinion mining (B. Liu, 2010; Pang & Lee, 2008), this development should not go without notice. As a matter of fact, there is no reason to think that systems like sentiment search engines (Kamvar & Harris, 2011; Pang & Lee, 2008) will not require technology which is able to cope which such high amounts of data. Parallel processing is a known strategy to process large amounts of data in a reasonable amount of time, simply dividing and distributing the task-set on a number of workers which process the task simultaneously. The last decade saw the introduction of a new parallel processing paradigm, called MapReduce (Dean & Ghemawat, 2008). Due to its inception in the world of search engines (Dean & Ghemawat, 2008; White, 2012), it is designed to process unstructured text well (White, 2012). The paradigm has spawned a number of implementations (K. Lee, Lee, Choi, & Chung, 2012), the most notable of which is Apaches Hadoop45 (K. Lee et al., 2012; White, 2012). 5.2 The MapReduce Framework
MapReduce is a programming paradigm that can be employed to write programming code which can be executed in a parallel and distributed fashion (Dean & Ghemawat, 2008). It was created within the walls of Google Inc. 46 and made public in a paper published by Dean and Ghemawat in 2004 (Dean & Ghemawat, 2008; White, 2012). MapReduce allows programmers to write distributed programs without the explicit need to consider and manage the issues involved in parallelization (Dean & Ghemawat,
45 46
http://hadoop.apache.org http://www.google.com
64
2008). A central point of the concept is the separate implementation of a run-time system which takes over the details of partitioning the input data, scheduling execution across machines, handling failures and inter-machine communication (Dean & Ghemawat, 2008). Programmers simply write their programs to confirm with the specification, and the run-time system takes care of the distributed execution (Dean & Ghemawat, 2008). The system should also ensure fault-tolerance, data-distribution, and load-balancing (Dean & Ghemawat, 2008). The concept is based on the map and reduce primitives known from functional programming languages (Dean & Ghemawat, 2008). Programmers employing the MapReduce model write programs in which single discrete tasks are expressed as a map function which processes a key/value pair and emits (one or more) intermediate key value pair(s) which are subsequently processed by a reduce function (Dean & Ghemawat, 2008). The reduce function merges all intermediate values which are associated with the same intermediate key, producing an output of the size of the input as maximum but in most cases less (Dean & Ghemawat, 2008). Writing against this interface enables automatic parallelization and distribution (Dean & Ghemawat, 2008). MapReduce was originally designed with computing clusters in mind (Dean & Ghemawat, 2008), but the paradigm can also be applied to multi-core systems (Chu, Kim, Lin, & Yu, 2007; K. Lee et al., 2012). An example program written in pseudo-code to illustrate the model is given below47. The task is to count all words in a document collection, creating a list of words with the number of occurrences in the collection. Each document is a single input record which is supplied to the map function as a key/value pair: <document_name; content>
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
The map function iterates word-by-word through the content and emits a <word, 1> pair for every word. As words can occur more than once in a sentence, and more than once across multiple documents, the emitted pairs are collected and stored for each key.
47
From (Dean & Ghemawat, 2008, p. 138); Other examples can be found in the same paper.
65
This will create something like a tally sheet in this particular case, e.g. an entry for each word in the corpus with a list of 1s. The count of 1s is equal the count of occurrences of the word in the collection. The reduce function will iterate through the list of values for each key and sum the values up. The output of the function is then the actual count of occurrences of this word in the corpus. As one can easily see, each record can be independently processed by the map function, allowing multiple map functions to be executed in parallel. The restriction for the reduce function is only that all values associated with the same key are processed by the same reduce function, again enabling parallel execution of all intermediate records (as intermediate output are collected on a key basis).
Block 1 Block 2 Block 1 Block 2
Mapper
Mapper
Mapper
Intermediate Result
Intermediate Result
Intermediate Result
Reduce
Reduce
Output 1
Output 2
Figure 16: MapReduce Flow As map and reduce functions can be executed in parallel, a map reduce program can run map and reduce functions on multiple machines (or even multiple functions on each single machine) (Dean & Ghemawat, 2008). This requires some coordination of the input, intermediate, and output files: Input data is divided into splits, where is the number of instances of the map function, and each split is fed to a map function (Dean & Ghemawat, 2008). The intermediate key/value pairs are partitioned by a usersupplied portioning function on the key space, splitting the data into partitions ( is
66
the number of instances of the reduce function). Each program requires a mapreduce specification object which contains properties such as the input and output path. As computations can be executed across multiple machines, intermediate results produced by one machine can be reduced in-situ by a specified reduce function, which is in this case called a combine function (Dean & Ghemawat, 2008). This reduces the amount of data that has to be transferred over the network. The implementation of the run-time system described in the paper also introduced the idea of locality to improve processing speed (Dean & Ghemawat, 2008). Instead of transferring large amounts of data across the network, the presented implementation attempts to execute the program close to its place of storage, ideally on the same machine. This of course blurs the execution/storage distinction and requires data to be stored on the same machines which routinely perform computations, instead of having distinct, separate and specialized systems for storage and computation. But as programs are many times smaller than the data files they process, this will utilize the network bandwidth, usually a scarce resource, much more efficiently (Dean & Ghemawat, 2008). 5.3 5.3.1 Hadoop-based Parallel Processing Hadoop: A MapReduce Implementation
Hadoop48 is the name of an open-source implementation of MapReduce (White, 2012), curated by the Apache Software Foundation49. It is the reference implementation of the MapReduce framework (K. Lee et al., 2012), as others are proprietary (Google) or not as widely used (K. Lee et al., 2012). Hadoop is in use in a number of companies, such as Facebook, Twitter, Yahoo!, LinkedIn, CBS Interactive, Foursquare and many others (Kalakota, 2011; J. Lin & Kolcz, 2012; Stonebraker, Abadi, & DeWitt, 2010; Thusoo et al., 2010; White, 2012). Hadoop is designed to work on very large datasets, capable of processing many hundreds of gigabytes per query (White, 2012). It is a batch query processor, built on the assumption that an entire dataset, or at least a large part of it, has to be processed to answer any given query (White, 2012). The input- and output-format are file-based, as datasets are likely of a size which will not fit into memory (White, 2012).
48 49
http://hadoop.apache.org http://www.apache.org
67
Releases50 of Hadoop include three to four modules: Versions 1.X and 0.2X include the modules Hadoop Common, Hadoop Distributed File System, and Hadoop MapReduce (The Apache Software Foundation, 2012a; White, 2012). Version 2.X includes the additional component Hadoop YARN, a newer version of the runtime (The Apache Software Foundation, 2012a; White, 2012). At the time of writing (2012) 2.X was still in the alpha stage, therefore the description of the Hadoop framework considers only the 1.X and 0.2X lines. The three central components of the Hadoop distribution: Hadoop MapReduce (MR) Distributed data processing model and execution environment (White, 2012): Hadoop Distributed File System (HDFS) The distributed file system, optimized for high throughput (The Apache Software Foundation, 2012a; White, 2012); Hadoop Common Utilities which support the other modules like serialization, Java RPC, persistent data structures) (The Apache Software Foundation, 2012a; White, 2012).
These components make up the basic Hadoop framework. In addition, a number of other projects have created add-ons which work on top of these components to extend the core functionality in many ways (K. Lee et al., 2012; White, 2012). Some examples are Hive (Thusoo et al., 2010), a distributed data warehouse; Pig (K. Lee et al., 2012; White, 2012), a high level analytical language; Mahout (Owen, Anil, Dunning, & Friedman, 2012), a machine learning module; and ZooKeeper (K. Lee et al., 2012; White, 2012), a distributed coordination service. HDFS HDFS allows the storage of very large files (from hundreds of megabytes to terabytes in size) and provides streaming data access following a write-once, read-many-times pattern (White, 2012). Files are broken down into blocks for storage, the block size is configurable and the default block size is 64 MB (White, 2012). Blocks of a single file can be distributed across multiple storage disks, even on other nodes of the cluster (White, 2012). Typically stored files are replicated (3 times per default), primarily for fault-resistance, but replication also allows multiple jobs to use the same data without the need to wait on each other (White, 2012). A HDFS cluster has two types of nodes in
50
Hadoop Releases: http://hadoop.apache.org/releases.html
68
operation: One Namenode (master) and at least one datanode (slave) (White, 2012). Figure 17 illustrates this. The Namenode manages the filesystems namespace, while the Datanodes store file blocks and provide write and read access to them to requesting clients (user programs) or the Namenode (White, 2012). MapReduce A MapReduce job is a unit of work, performed for the client (User program). Under the Hadoop framework, a typical MapReduce job is comprised of three parts: A Driver class, a Mapper class, and a Reducer class (White, 2012). The Driver configures a (org.apache.hadoop.mapreduce.)Job object which holds the specification of the MapReduce job. The Job object is used by the Driver class to run and control the MapReduce job (White, 2012). The Mapper class used in the job must be a subclass of the abstract class (org.apache.hadoop.mapreduce.)Mapper, and it must implement the map method which fulfills the purpose of the map method of the MapReduce programming model (i.e. is applied to every input record and will emit an intermediate result). Analog, the Reducer class used in the job must be a subclass of the abstract class (org.apache.hadoop.mapreduce.)Reducer, and must implement a reduce method accordingly. The reduce method will be called for each key with a collection (Of type java.lang.Iterable) of all intermediate values of the same key. An example of a simple MapReduce job can be found in the appendix (Table 11, p. 134). Mapper and Reducer classes can be instantiated multiple times per job, so that each object may only see a part of the data. These subunits of the job are referred to as tasks. The division of input files into fixed-sized pieces is called splitting, the pieces are called input splits. A task will be created for every split. As reducers are more often than not on another node, a Combiner class can be used to reduce the intermediate input of one node before it is transferred to the reducer. Input is expected in the form of files, and intermediate and final outputs are written to file(s) as well. A line in the files is a record, and every record is a key-value-pair. Natively, Hadoop can process files of three types: Text files, SequenceFiles, and MapFiles. Text files are ordinary text files, while SequenceFiles and MapFiles are special binary files which are optimized for efficient operation within Hadoop. MapFiles differ from SequenceFiles in that their records are ordered by their keys. Data types that are processed by the framework must be serializable, thus key and value datatypes must implement the (org.apache.hadoop.io.)Writable interface. For the most part, this can be achieved by using a wrapper class which implements the
69
necessary methods to serialize and deserialize the data to and from file. Keys must also implement the (org.apache.hadoop.io.)WritableComparable interface, which allows the system to sort the records and distribute them accordingly. The job execution requires an infrastructure consisting of two types of nodes: One JobTracker (Master) and at least one TaskTracker (Slave). The JobTracker schedules and manages the jobs, while the TaskTracker(s) execute the task(s). Typically, the TaskTracker and DataNode share one computing node (one machine) so that tasks can be executed on the machine which holds the data (to exploit data locality). In order to distribute the computation, the code has to be distributed to the executing TaskTracker. This is done, by distributing the Java Archive (Jar) of which contains the job. If the job is dependent on other resources, other libraries for example, these resources will have to be distributed as well. An illustration of the infrastructure can be found in Figure 17 below:
Figure 17: Infrastructure of a Hadoop cluster 5.3.2 Nutch
Apache Nutch 51 is software project which started with the aim of creating an open source search engine (White, 2012). It belongs to the family of projects around Apache Lucene52 (Full-text search library) to which also Apache Solr53(Search platform) belongs (The Apache Software Foundation, 2012b). The Nutch project provides a Java library which contains implementations to provide crawling and parsing capabilities (The Apache Software Foundation, 2012b). Nutch can be run in a parallel and distributed
51 52
http://nutch.apache.org/ http://lucene.apache.org/core/ 53 http://lucene.apache.org/solr/
70
fashion on a Hadoop cluster (White, 2012). In fact Hadoop is a spin-off of the Nutch project (White, 2012). In this Thesis, Nutch has been used as a crawler to collect HTML documents from web sites ( 6.3.1). The Nutch library in version 1.5.1 has been used in this thesis. This was the latest stable version at the time work on this thesis began, the latest stable version at the time of publishing is 1.6 (December). 5.3.3 Mahout
Mahout54 is an open source machine learning library written in Java and curated by the Apache Software Foundation (Owen et al., 2012). The library includes implementations of a number of ML algorithms, among them clustering and classification algorithms (Ingersoll, 2012; Owen et al., 2012). Mahout has been built with scalability in mind, and the implemented algorithms are designed to work on large to extremely large data sets (Owen et al., 2012; The Apache Software Foundation, 2012c). Mahout works well in a Hadoop environment, because many algorithms have been implemented in a parallel version using the Hadoop framework as the parallelization platform (Owen et al., 2012). Thus, these algorithms naturally support Hadoop input/output formats (such as SequenceFiles), can be run in parallel on a Hadoop cluster, and can be integrated into Hadoop workflows (Owen et al., 2012). Mahout is still in development (Owen et al., 2012; The Apache Software Foundation, 2012c), which is something that has to be considered as three major issues can impede a project: Desirable features or algorithms may be missing (A list of implemented classification and clustering algorithms can be found in the appendix: Table 12, p. 136), Major API changes (Such as from version 0.6 to 0.7), Missing/Incomplete Documentation (As of the time of writing, many parts of the documentation on Mahouts website are either outdated or missing).
In this thesis, the Mahout library in version 0.7 has been used (The latest stable release).
54
http://mahout.apache.org
71
5.4
Parallelized Opinion Mining
Very recently, a number of works have explored the utilization of the MapReduce paradigm to parallelize Opinion Mining (Khuc et al., 2012; Yerva et al., 2012), or have stated that they intend to explore it in the future (Akcora et al., 2010; Kamvar & Harris, 2011; Silva, Carvalho, Sarmento, & Magalhes, n.d.). The motivational factor behind those works is usually that the amount of data that has to be processed is very large and cannot be processed in a reasonable time frame using conventional methods (Akcora et al., 2010; Bautin, Ward, Patil, & Skiena, 2010; Fen, Yabin, & Yanping, 2012; Johnson, Shukla, & Shukla, n.d.; Khuc et al., 2012). In the surveyed works, MR has been applied in four areas: These are storage, pre-processing, training/lexicon induction, and classification. Hadoop was used in all of the works, which have explicitly mentioned a framework. 5.4.1 Storage
Large datasets require adequate storage facilities. As Hadoop comes with a specialized file system for large files and the exploitation of data locality requires the execution of the code in the datas vicinity, the choice of using Hadoop based storage systems doesnt surprise. In addition to HDFS (D. Li et al., 2012; J. Lin & Kolcz, 2012; Yerva et al., 2012), some works utilized Hadoop-based database systems to store intermediate and final processing results (Khuc et al., 2012; Yerva et al., 2012). 5.4.2 Training and Lexicon Induction
Parallelizing the training or lexicon induction is probably the most challenging problem, yet also the one with the highest pay-off. Training a model or creating a lexicon takes time, and parallelizing the task can speed this process up. In addition, ML algorithms usually improve with larger training sets and a parallel learning process makes training on very large training sets feasible. Again, challenges and solutions are difficult for lexicon induction and supervised classifiers. Lexicon Induction As methods and algorithms used for lexicon induction differ a lot ( 2.4), it is hard to generalize the parallelization process. Essentially the potential for parallelization is dependent whether the individual algorithms can be translated into a parallel algorithm. An example is the work of Khuc et al. (Khuc et al., 2012), which translated the corpusbased approach of (Velikovich & Blair-Goldensohn, 2010). The original paper uses a twitter corpus created with the emoticon-trick ( 2.3.3), with the positive and negative
72
emoticon list used as seed words. It uses four steps to create a graph where the nodes are words present in the corpus and edges represent a co-occurrence of the words in the corpus, with edge-weights being a similarity measure on the co-occurrence counts. After a pruning step, sentiment-labels are propagated from the seed words to unlabeled notes. These steps have been individually translated into separate Map-Reduce jobs (Khuc et al., 2012).
Supervised ML classifiers Parallel training of classifiers is severely hindered by the fact that most existing classifier algorithms and implementations are very centralized, which makes them impossible or at least very costly and impractical to implement in a parallel fashion (W. Chen, Zong, Huang, & Ou, 2011; J. Lin & Kolcz, 2012). But there are two possible solutions to create a parallel classifier: Parallelizable classifier algorithms and ensembles (J. Lin & Kolcz, 2012). A number of algorithms can be expressed in a way which can be transformed into a MapReduce algorithm (W. Chen et al., 2011; Chu et al., 2007). Another proposed solution is the use of online learning algorithms employing Stochastic Gradient Descent (SGD) as the (or part of the) learning approach (Agarwal, Chapelle, Dudik, & Langford, 2012; Bottou, 2010; J. Lin & Kolcz, 2012).These methods have in evaluation on textclassification problems demonstrated to be able to deliver comparable results to SVMs, the current standard (Agarwal et al., 2012; Bottou, 2010). It should not be left unmentioned, that SGD-based training will only approximate the optimal solution (Shalev-Shwartz, Singer, & Srebro, 2007). The Mahout implementation contains a parallel (Complementary-) Nave Bayes algorithm as well as a family of Logistic Regression algorithms which are trained using SGD (Owen et al., 2012). Logistic Regression has been shown to be able to approximate SVMs in text-classification problems when applied to a large text corpus (Zhang et al., 2003). The second option is to parallelize sequential algorithms using ensembles (J. Lin & Kolcz, 2012). Ensemble methods combine a number of classifier models learned independently on partitions of the training set (Han et al., 2006). As the models are learned independent from each other, training can happen simultaneously (J. Lin & Kolcz, 2012). This therefore is employ sequential classification algorithms, like sequential SVMs, in a parallelized and distributed fashion (J. Lin & Kolcz, 2012). In a bagged ensemble, classifiers first learn independently on a subsample; in operation, they will classify an input document independently and a combination method then decides
73
the final classification label based on the individual outputs (Figure 18) (Han et al., 2006). The simplest combination function is the majority voting function, which decides the class label based on the mode55 label of the classifier output (Han et al., 2006; J. Lin & Kolcz, 2012). Ensembles usually increase the accuracy by reducing misclassification errors (Han et al., 2006; J. Lin & Kolcz, 2012).
Sources
Classifier 1
Classifier 2
Classifier 3
()
Classifier n
Combine Votes
Prediction
Figure 18: Voting in Ensemble Methods Left to discuss are some practical considerations when building trainable classifiers in MapReduce. These considerations are also relevant when implementations like Mahout or others are used. The first is the requirement of a controller, something that controls and directs the training algorithm (W. Chen et al., 2011). In the MapReduce context, a controller is usually a driver which controls one or multiple MapReduce jobs (White, 2012). Or it can be a Pig-Script as in (J. Lin & Kolcz, 2012). The classifiers themselves can be wrapped in an interface which hides the details of parallel execution. In (J. Lin & Kolcz, 2012), a classifier exposes a classify method which takes vectorized documents as input and returns a classification object (a label) as output. Batch learners expose a train method which takes a set of vectorized and properly labeled documents as input and returns a model as output. Online learners, which stream through training samples to update the model, expose an update method which takes a labeled input vector and updates the internal model. Both learners are able to serialize their models to file, while classifiers can load a model. As the authors use Pig, the parallelization is implied and the interfaces are an abstraction from that. But the general idea can be broadly applied, which allows to expose interfaces for classifiers while hiding the parallelization. Table 2
55
The label most classifiers submitted.
74
shows the generalized interfaces classifier, and the trainables batch learner and online learner. Classifiers require a model and optional parameters to be instantiated, in operation they receive properly vectorized data v, classify the records and return a classification object, commonly a label. As this will be executed in a MR-environment, the input data will most likely be in the of files. Therefore a proper input may very well be the location of one or multiple files of input records in a suitable format. The output will also probably be saved to file, containing classification labels for each input record. Interface classifier batch learner Constructor Input model or lexicon (parameters) data to classify <v> classify classification object train model training data <l,v> trainable online learner (parameters)
Method Output
update
Table 2 Generalized interfaces Trainables may require some parameters to be instantiated. They will work on pairs of labels and vectorized documents <l,v>. Here again, training datasets will most likely be large files stored at a retrievable location. Output is a model, typically serialized to file. Batch learners expose a train method, while online learners expose an update method. Several behaviors of MapReduce-Implementations can be exploited to support typical machine learning procedures such as randomizing (shuffling) records, partitioning data into training an test sets, and to train bagged ensembles (J. Lin & Kolcz, 2012). As MR-Implementations sort output pairs of the map-phase by key, it is easy to randomize the order of the records by assigning random numbers as keys (J. Lin & Kolcz, 2012). The pairs will be sorted by key and therefore have a new, randomized order in which they are fed to a reducer. With the same method, a data set can be split into training and test sets (J. Lin & Kolcz, 2012): After assigning a random number ] as keys, records are then split by the key into training and drawn from the interval [
75
test file(s) by defining a ( ) threshold where is the relative size of the test set. For example, to create a 90/10 split, where 90% of the records are used as the training sample and 10% percent as the test sample, the -value will be 0,1. Therefore all ( ) records with a key are stored into the training file(s) while all ( ) records are stored in the test file(s). In Hadoop MapReduce this can be achieved in three ways: Either through (1) a chain of three map-passes, (2) a mapreduce job which collects the smaller-test set and writes it to a separate file, or (3) by using two reducers which receive one partition of the key space. The first approach uses three map-no-reduce jobs, jobs which do not have a reducer. The first job assigns random keys; the second job passes over the output of the first job and emits only ( ) to a directory designated to hold the training set; the records with keys ( ) to a directory third pass does the same, but only emitting pairs with keys designated to hold the test set. The second approach contains a mapper which assigns ( ) to files in random number keys and a reducer which emits pairs with keys the training set directory. It stores all other pairs in a separate file, requiring it to either open a writing stream to a new file or to hold the data in memory until it can be written to file. The third, and probably most elegant solution, is to employ to simple reducers which receive a partition of the key space as each reducer will create a separate file. This however requires a custom partition function. The whole process is a lot easier to describe in a Pig-Script (J. Lin & Kolcz, 2012, p. 798):
data = foreach data generate target, features, RANDOM() as random; split data into training if random <= 0.9, test if random > 0.9;
The fact that reducers each produce an individual output file can be exploited to train bagged ensembles of classifiers (J. Lin & Kolcz, 2012). Using again a mapper which assigns random numbers as keys, the training set can be partitioned over the key space. A reducer wraps a learner and feeds the received records to it. After all records are processed, the final model is emitted. By setting the number of reducers, the number of created models can be controlled. In experiments, running times could be reduced by increasing the number of machines as expected (Cho, Jung, & Kim, 2010; Khuc et al., 2012). Relative reduction of running time per machine seems to be proportional to the size of the dataset (Cho et al., 2010; Khuc et al., 2012). What is interesting to note, is that reductions per machine have a diminishing rate of return, as the running time seems to converge on a lower limit proportional to the size of the dataset (Cho et al., 2010; Khuc et al., 2012). This may be
76
caused by the overhead produced by setting up and coordinating MR jobs (Cho et al., 2010; Khuc et al., 2012).
Training Set Pt. 1 Training Set Pt. 2 Training Set Pt. 3 Training Set Pt. n
Randomizer Mapper
Randomizer Mapper
Randomizer Mapper
Assignment of random number keys
Intermediate Result
Intermediate Result
Intermediate Result
SHUFFELING BY KEY
Training of a single classifier model
Trainer Reducer
Trainer Reducer
Model 1
Model 2
Figure 19: Using reducers to train ensembles 5.4.3 Preprocessing
The majority of the works which utilize MapReduce use it to support the classification process by parallelizing steps of the pre-processing workflow (Fen et al., 2012; Johnson et al., n.d.; Kraychev & Koychev, 2012; Yerva et al., 2012). Steps that are easily expressible in a MR algorithm are any splitting, chunking, and tokenizing tasks (Yerva et al., 2012), tagging tasks (Khuc et al., 2012), metadata extraction (Kraychev & Koychev, 2012; Yerva et al., 2012), filtering tasks (Johnson et al., n.d.; Khuc et al., 2012), and tasks which generate input vectors from text (Fen et al., 2012; Khuc et al., 2012). In addition, MR programs can be used to parallelize the retrieval of additional metadata, for example to fetch location information for twitter users (Johnson et al., n.d.). Of course, the initial transformation of fetched documents from their particular format to a new format is also possible (Kraychev & Koychev, 2012; Yerva et al., 2012). There are two general approaches to employ MR: The first one is to simply wrap an atomic task in a Mapper (or Reducer), the second and more difficult one is to write a specific algorithm which is adapted to the MR environment (Khuc et al., 2012). The first approach is very simple and allows the reuse of already existing code: For example the inclusion of a tagger in a map function (Khuc et al., 2012); or applying an
77
opinion-mining-oriented parser in parallel to a set of documents (Kraychev & Koychev, 2012). Pig56, a programming language which can be compiled into executable map-reduce jobs (J. Lin & Kolcz, 2012; White, 2012), is also frequently used for analytical processing (Johnson et al., n.d.; J. Lin & Kolcz, 2012). 5.4.4 Classification
To use a classifier in production, it has to be embedded in a map or reduce function (Cho et al., 2010; Yerva et al., 2012). Or if Pig is used, in a user-defined function (UDF) (J. Lin & Kolcz, 2012). While pre-processing steps can be executed in advance before the classification job, sometimes speed advantages can be achieved by combining preprocessing or metadata extraction with sentiment extraction (Cho et al., 2010; Yerva et al., 2012). This will reduce the overhead associated with setting up MapReduce jobs (White, 2012). As stated in the interface description of the classifier in Table 2 classifiers will require that a particular resource, a sentiment-lexicon for scorer and a model for trained classifiers, will have to be loaded before classification (Cho et al., 2010; Yerva et al., 2012). In Hadoop, this can be achieved by using the setup method of the Mapper or Reducer class which can load resources from files and construct the classifier (White, 2012).
56
http://pig.apache.org
78
6
6.1
System Design
Overview
As part of this thesis, a system has been implemented for demonstration purposes, called DOX. The system allows a user to track subjects and analyses opinions expressed in documents retrieved from online sources associated with the subject. A HTML-based user interface (UI) allows users to enter tracking queries, administrate and control the system, and of course view the results of the analysis. The system has been implemented entirely in Java, utilizing the Hadoop framework ( 5.3) and a number of freely available open-source libraries. DOX is built around a central process (Figure 20): Users can define subjects which they are interested in and which they would like the system to track. The system is able to track multiple subjects at once, and in the context of the system each tracked subject is referred to as a tracking job. Based on the list of tracking jobs, the system collects documents relevant to the subject of the job from online sources. Collected documents are analysed and the results are presented to the user. An update process updates the result in the system and can be triggered if either (a) new jobs have been added or (b) in regular update intervals to retrieve and analyse new documents for all active tracking jobs.
Definition
Collection
Analysis
Presentation
Update
Figure 20: Central Process The system is therefore in some way similar to opinion search engines (Pang & Lee, 2008), which allow users to query and retrieve subjective content. To be able to answer any possible query immediately, such engines would have to have a collection effort in scale comparable to existing web search engines (like Google or Yahoo), which means that they would have to attempt to retrieve and analyse all possibly relevant documents. To reduce this complexity, DOX is limited to tracking jobs although the system can be adapted to work in a similar manner ( 8).
79
6.2
Process Execution
The central process is being controlled by the ExecutionEngine class (Figure 21). The ExecutionEngine accepts a collection of TrackingJobs over the submitJobs(Collection<TrackingJob> jobs) method. The methods startCrawl()will cause the collection of web documents to begin while the startTweetCollection() method will make the system begin the collection of tweets. As Tweets are collected from a continuous stream of status updates, the method stopTweetCollection() will stop the running collection process. The method analyse() will start the analysis process. The analysis process will process all collected segments which have not yet been analysed. It determines this fact by the absence of a result snapshot for the time period (e.g. no corresponding folder).
<<interface>> org.apache.hadoop.conf.Configurable <<interface>> StatusEmitter +setStatusListener(listener: StatusEventListener)
ExecutionEngine +ExecutionEngine() +analyse(): void +getCrawlStatus(): StatusHolder +getTwitterStatus(ProgressableStatusHolder statusHolder): void +startCrawl(): void +startTweetCollection(): void +stopTweetCollection(): void +submitJobs(Collection<TrackingJob> jobs): void
CrawlerExecutionEngine +CrawlerExecutionEngine() +executeJobs(): void +submitJobs(Collection<TrackingJob> jobs): void
AnalyserExecutionEngine +AnalyserExecutionEngine() +executeJobs(): void +submitJobs(Collection<TrackingJob> jobs): void
Figure 21: Control classes There are separate Classes which control subordinate parts of the process: The CrawlerExecutionEngine and the AnalyserExecutionEngine. The former controls the collection process for web documents while the latter controls the analysis of collected segments. Both start their respective processes when the executeJobs() method is called. They are able to communicate detailed progress updates by implementing the StatusEmitter interface, which allows the submission of an object through which status events can be relayed. All classes (ExecutionEngine, CrawlerExecutionEngine, AnalyserExecutionEngine) implement org.apache.hadoop.conf.Configurable interface and the
The implementation of these classes utilizes multi-threading to facilitate in-machine parallelization (Figure 22). The ExecutionEngine will run the collection of web documents and twitter status updates in two separate threads, while the AnalyserExecutionEngine will run the analysis process for each snapshot in a
80
separate thread. This is a necessity, as the collection and analysis processes are long running processes which would otherwise block the execution of other processes while running. They are parallelizable on a thread level, as they have no dependency on each other. Parallelization on the other hand is advantageous, as for the crawl (collection of web documents) and analysis process are for the most part MapReduce jobs which do not tax the resources of the controlling too much as they are for the most part executed on other machines.
TwitterCollectionThread +TwitterCollectionThread(String[] queries, String outputPath, String outputMeta, int saveInterval, ProgressableStatusEventListener listener, Configuration conf) +stopCollection(): void
Thread
CrawlerThread +CrawlerThread(Collection<TrackingJob> activeJobs, Configuration conf, StatusEventListener listener)
AnalyserThread +AnalyserThread(String segment, String queryTermsFile, Configuration conf)
Figure 22: Thread classes 6.3 Collection
The collection of relevant documents from online sources is a necessary part of the process. As stated in section 3.4 Data , documents are commonly imported either using a crawler or by using a provided API of a service provider. In DOX, a crawler is used to collect documents from the web and twitter status updates are imported using Twitters Streaming API (Twitter Inc., 2012a). 6.3.1 Collection of Web Documents
Documents from the web are collected using Apache Nutch in version 1.5.1 as a crawler ( 5.3.2). Nutch has been selected because it works very well in a Hadoop environment since both developed closely together and because it naturally supports parallelization across multiple machines as its internal process are implemented as Hadoop jobs. The crawl process is controlled by the CrawlerExecutionEngine. The CrawlerExecutionEngine is run in a separate Thread, the CrawlerThread, which is set up and started by the ExecutionEngines startCrawl() method. The operation of the CrawlerExecutionEngine has two separate active phases: In the first phase the current crawl run is prepared (SETUP), the second executes Nutch (OPERATION).
81
In the setup phase, a set of seed URLs is prepared. Nutch requires a list of URLs which are used as seed (Nagel, 2011) and from which other websites are discovered as the crawler follows the links embedded in the pages. The quality of the seed set is important, as the discovery of relevant documents requires that they are reachable through traversal of the link structure. Therefore, the seed URLs should be relevant to the subject, and highly central in a link-graph of relevant and subjective documents (i.e. other relevant and subjective documents are reachable from the seed URLs). For this purpose, the initial set of seed URLs is being compiled using the Google Search 57 engine. Using the Google Custom Search API 58 , a seed list of around 100 URLs is being created for each active tracking job. The custom search API provides methods to access the Google Search engine and to retrieve search results programmatically. The API is queried with the query terms contained in the tracking job, each separated by an OR-operator (|), which can be referred to as the basic query. Working under the assumption that subjective documents will likely include links to other subjective documents, it would be advantageous to increase the amount of URLS which link to subjective documents in the seed set. Therefore five terms were chosen which are assumed to indicate subjective documents: news, opinion, blog, review, and comment. These terms are then individually combined with the basic query to form expanded queries using the AND operator. The seed set for a particular tracking job is then created by submitting the basic query and the five expanded queries individually to the custom search API. The first 20 result of each query are then collected and compiled into a seed set of at most 100 URLs for a tracking job. As the system currently only supports English language documents, search results are restricted to English pages only. Individual seed sets of all tracking jobs are joined into a seed set for the current crawl. In the first crawl, the seed set will be the only list of known URLs for the crawler. In all following executions, the URLs stored in Nutchs CrawlDB will form the basis of the crawl. The CrawlDB is updated before each re-crawl by injecting it with the URLs from the seed set generated in the setup phase to ensure that newly created documents are included and the crawl properly reflects the current state of the web. In the operation phase, Nutch crawls the web and thereby retrieves the documents and updates its database of known URLs. Using Siren and Charrons LanguageIdentifier Plugin (Nutch Wiki, 2009), the language of retrieved documents is identified and nonEnglish documents are removed. As the crawler fetches the documents, it will also parse
57 58
http://www.google.com https://developers.google.com/custom-search/
82
the retrieved documents and store their plain text contents as well as metadata in files. The parsed plain text content files (Parsed text) and parsed metadata (parsed data) can then be consumed by analysis pipelines. The crawlers output is stored in the dox.crawler.dir directory, per default ${dox.base.dir}/crawler. 6.3.2 Collection of Twitter Status Updates
Twitter status updates are collected by utilizing the Twitter Streaming API (Twitter Inc., 2012a) which is accessed using the Twitter4J library 59 . Twitter status updates are collected from the public sample stream, which is a sample selection of all current public tweets (Twitter Inc., 2012a). The sample is filtered using a joint set of query terms of all active tracking jobs. The collection process is controlled by the TwitterImporter class which uses the Twitter4J library to open a filtered sample stream. Incoming tweets are buffered, processed and written to file once the buffer has reached a pre-defined size. This reduces processing overhead while simultaneously ensuring that collected tweets are not all lost should the system fail spontaneously for some reason. The number of status updates collected in the buffer before they are processed should be chosen in a way which balances the number of status updates lost if a failure occurs while simultaneously preventing the files from being too small (ideally close to the HDFS block size, per default 64MB). Status updates are only added to the buffer if they fulfil two conditions: They must exceed a length of at least 20 characters and they must be written in English. It is however difficult to properly check for the fulfilment of the second condition, as Twitter messages do not include metadata which contains the language code of the message. And although users can choose their primary language, users often submit messages written in other languages. To identify the language properly, a language identifier has to be used. In this case Apache Tikas60 LanguageIdentifier has been used (The Apache Software Foundation, 2011). But because the identifier has been trained on longer text documents, it frequently misclassifies twitter status updates (which are very short). However, it has been observed that English (en) is often misclassified as Norwegian (no) or Hungarian (hu), but tweets which have been determined to be either en, no, or hu are rarely written in a language other than English if the user has indicated that English is his or hers primary language. Therefore a heuristic has been employed to reduce the amount of misclassifications while simultaneously collecting as much status updates as possible: First, all tweets from users who have not selected English as their
59 60
http://twitter4j.org/en/index.html http://tika.apache.org
83
primary language are removed, then the language of the tweet is identified through the LanguageIdentifier. The status will be added to the buffer, if the language is identified as either en, no, or hu. The output of the TwitterImporter adopts the file structure of the Nutch crawler. An overall uniform data structure enables other processes to read the collected data in a standardized way and therefore eliminates the need to implement different readers. Like Nutch, two SequenceFiles are created every time a batch gets written to the filesystem: parse_data/part-[n]/data (Parsed data) which contains metadata; and parse_text/part-[n]/data (Parsed text) which contains the status update text. Both are stored in a segment directory which is named by a string representation of the current timestamp in the form of YYYYMMDDHHMMSS (e.g. 20121102040222 for the 02.11.2012 4.22 am and 22 seconds). The variable [n] is replaced with a number which allows more than one file to be stored in a single segment. This is similar to the file structure created by Nutch ( 5.3.2). All segments are stored in the dox.twitter.dir directory, per default ${dox.base.dir}/twitter. Both files are SequenceFiles which contain <Key,Value> pairs. The keytype is Hadoops Text class and is a pseudo-url of the type twitter://[hashtag]:[user]:[tweetId]. The variable [hashtag] is replaced by the first hashtag of the tweet or _nohash if there is no hashtag included; The variable [user] is replaced the user handle and user id of the sender of the status update (i.e. peter6615634815 if the twitter user has the username peter66 and the id 15634815); and the variable [tweetId] with the unique id of the twitter status update. The parsed text files value is of type org.apache.nutch.parse.ParseText.ParseText and simply stores a sanitized version of the twitter status updates text. The text is sanitized by using regular expressions (Regex) to remove or replace unwanted content or characters (Table 3). The parsed data files value is org.apache.nutch.metadata.Metadata.Metadata, an object which stores metadata in named pairs (<key, value>) of type <String, String>. The metadata stored for each tweet is described in Table 4.
84
Type User handles @(\\S)+
Regex
Action Remove (Optional: Replace with generic reference) Remove Replace all whitespace with a single whitespace Remove Remove
URLs Line breaks Whitespace
((mailto\\:|(news|(ht|f)tp(s?))\\://){1}\\S+) Remove \\r?\\n (\\s)+
Leading whitespace Trailing whitespace
^(\\s) (\\s)$
Table 3: Sanitizing twitter status updates
Property name userid username
Description User id of the twitter user who originated the status update Real world name of the twitter user who originated the status update (as entered by the user) User handle of the twitter account (Screenname) Primary language as submitted by the user Location as submitted by the user Geographical submitted) Geographical submitted) position, position, longitude latitude (if (if
userhandle userlang userlocation Geotag.long Geotag.lat date hashtags language Table 4: Stored Twitter metadata
Timestamp (Creation of status update) Concatenated list of hashtags (split by semicolon ;) Language code, as identified by the language identifier.
85
6.4
Analysis
The analysis module is a library which implements the sentiment analysis functionality of DOX. It provides an infrastructure, the pipelines, to define analysis processes and implementations to pre-process collected documents, to classify their sentiment orientation, to provide summaries of documents, and to detect similar opinions on the basis of clustering approaches. It utilizes the Apache Mahout library ( 5.3.3) to provide implementations of common classification and clustering algorithms and the Apache OpenNLP 61 library to provide proven and accepted implementations of NLP algorithms (Boiy, Hens, Deschacht, & Moens, 2007; Feinerer, Hornik, & Meyer, 2008; Johnson et al., n.d.; Lu et al., 2009; Nagar & Hahsler, 2012; Sarawagi, 2008). 6.4.1 Pipelines
The Pipelines allow assembly of analysis processes at run-time, while at the same time creating an abstraction layer between the analysis process and the parallel implementation of it. The Pipelines share two properties with their real-life counterparts: (a) They possess directionality, the flow is from input to output; And (b) they can be assembled from parts, in this case PipelineStages. However, unlike their counterparts, the input can be transformed when flowing through the Pipeline. For example: An analysis Pipeline to determine the sentiment orientation of documents would take collected documents as input and return a file which contains the sentiment orientation for every input document. As described in chapters 2 and 3, the typical analysis process consists of a number of subordinate process steps. Typically documents are split into smaller parts (e.g. sentences), are tokenized and vectorized. In between, unwanted content may be filtered out. Then after the necessary pre-processing, the records are fed to a classifier (or in the case of ensembles, many classifiers). This might not sound like a problem, but it is: If a known best analysis process would exist, then the only necessary task would be to implement the process once and that would be the end of it. However, as is obvious from chapters 2 and 3, such an ultimate process is not known. Therefore it would be advisable to implement the constituent parts of process as atomic units which can be assembled into the necessary processes. In addition, if these processes can be assembled, new process steps can be added to the set of available steps at later points and the system would be extensible. To ensure this, common standards and interfaces are required. The PipelineStages provide these.
61
http://opennlp.apache.org/
86
PipelineStages The PipelineStage is an abstract class which represents a process step. In order to be executable within the framework, an implementation must be a subclass of this class. For example, to implement a class TokenizerStage which tokenizes documents, the class would have to extend the abstract class PipelineStage. PipelineStages can be configured by submitting named parameters in a java.util.Properties object. If no properties are submitted, the class can fall back on default properties. The basic properties of every PipelineStage are listed in Table 5. Property Input.Path Description A comma-separated list of input paths, either directories or file. The path where the output will be stored. The execution scopes of the pipeline as a commaseparated list. Table 5: Basic properties of PipelineStage The properties Input.Path and Output.Path are the locations for input and output files. The basic assumption of the framework is that Hadoop SequenceFiles are read as input by a PipelineStage and output is written in the same format. However, this is an implicit and not a strict assumption and implementations can break with it (producing null outputs, multiple outputs, require additional input, use different file types, etc.), which may cause failures at runtime from a wrongly configured Pipeline. The property Scopes contains a list of named scopes which allows users to bypass certain processing steps in a configured PipelineStage by setting a scope for the current run of the PipelineStage using the setCurrentScope(String scope) method. If the current scope is in the set of scopes for the PipelineStage the inScope() method will return true, otherwise false. Typically in DOX, PipelineStages will act as a controller for operations executed on a Hadoop cluster, either performing read/write operations on HDFS or acting as a driver for one or many MapReduce jobs. In this case, the implementations must be a subclass Default Value Empty
Output.Path
Empty
Scopes
all
87
of the abstract class ConfigurableStage which implements the org.apache.hadoop.conf.Configurable interface. This class implements the interface property Conf through which an org.apache.hadoop.conf.Configuration object can be set and held. The Configuration is required to set the correct execution parameters for operations in Hadoop clusters. The configuration is only expected to be not null at runtime, so that a configured PipelineStage can be run on any Hadoop cluster. Figure 23 shows the class diagram.
PipelineStage
+getCurrentScope(): String +getDefaultProperties(): Properties +getId(): String +getName(): String +getProperties(): Properties +getScopes(): String[] +inScope(): Boolean +setCurrentScope(String scope): void +setId(String id): void +setProperties(Properties p): void Properties +defaults: Properties +Properties() +Properties(arg0: Properties) +setProperty(arg0: String, arg1: String) +load(arg0: InputStream) +save(arg0: OutputStream, arg1: String) +store(arg0: OutputStream, arg1: String) +getProperty(arg0: String) +getProperty(arg0: String, arg1: String) +propertyNames() +list(arg0: PrintStream) +list(arg0: PrintWriter)
Properties 1 1
<<interface>> org.apache.hadoop.conf.Configurable +getConf(Configuration conf) +setConf(Configuration conf)
ConfigurableStage
0..1 0..*
org.apache.hadoop.conf.Configuration
Figure 23: PipelineStage and ConfigurableStage In the following PipelineStage will refer to the abstract class and its implementations while Pipeline will refer to a configured assembly of one or many PipelineStages. This however is only for the benefit of the reader to distinguish semantically between the implementation and a configured analysis process. There is no implemented class Pipeline and a configured analysis process, i.e. assembled collection of PipelineStages will also be a PipelineStage ( 0 Control Flow and MultiStagePipeline). Execution and Training The PipelineStages must be able to run in two modes: In production use in execution mode to convert production input data, i.e. collected documents, into production output, i.e. analysis results; And in training mode to train trainable stages, e.g. ML classifiers, on training data. Thus, from the requirements determined in section 0 and based on the generalized interfaces of Table 2 (p. 74) the interfaces ExecutableStage and TrainableStage have been developed (Figure 24). The interface ExecutableStage must be implemented by all PipelineStages which should be able to be executed in execution mode. The interface contains the method execute() which will be called to run this stages execution process, as well as in
88
training mode if the PipelineStage does not implement the TrainableStage interface. The TrainableStage interface must be implemented by all PipelineStages which are trainable. The interface contains the method train() which will be called to run the training procedure for the TrainableStage. Typically, a PipelineStage which is trainable will implement both interfaces, ExecutableStage and TrainableStage, and will therefore be able to execute the appropriate process depending on the mode of the pipeline.
PipelineStage
ConfigurableStage
<<interface>> ExecutableStage +execute(): void
<<interface>> TrainableStage +train(): void
ImplementedExecutableStage
ImplementedTrainableExecutableStage
Figure 24: ExecutableStage and TrainableStage TrainableStages will typically expect that the property Training.Input.Path will point to the appropriate training data files. If the train() method is called, but the property is not set, the stage will fall back to the property Input.Path. This fallback allows training data to be loaded and processed earlier in the pipe and then passed down through it, using the Input.Path property to describe where the processed data from the previous step has been stored.
89
Property Training.Input.Path
Description Path to training data. A comma-separated list of input paths, either directories or file. Fall-back, if Training.Input.Path is not set
Default Value Empty
Input.Path
Empty
Table 6: Common properties of TrainableStages If no current scope has been set, the system will set the current scope to execution when the PipelineStage is run in execution mode and to train otherwise. This allows to bypass steps in the configured Pipeline which are only relevant during training or execution (E.g. If training data are a collection of labelled sentences, while production data are collections of documents, a SentenceSplitterStage will be scoped as execution and not be run during training). Control Flow and MultiStagePipeline The execution of a process composed of multiple PipelineStages requires a collection of PipelineStages and definition of the process control flow through those PipelineStages. A PipelineStage which contains a collection of Pipelines must implement the MultiStagePipeline interface (Figure 25). An implementation of the MultiStagePipeline interface will have a property Stages, a java.util.Collection of PipelineStages, accessible through getter and setter methods. Executable and trainable MultiStagePipelines must also implement the ExecutableStage and TrainableStage interface respectively.
90
ConfigurableStage
<<interface>> TrainableStage +train(): void
<<interface>> MultiStagePipeline +getStages(): Collection<PipelineStage> +setStages(Collection<PipelineStage> stages): void
<<interface>> ExecutableStage +execute(): void
SequentialStage
ParallelStage
Figure 25: MultiStagePipeline Two types of control flows have been implemented: Sequential execution of PipelineStages and parallel execution. This refers only to the logical execution of the process as defined by a configured Pipeline, and must not be confused with the parallel execution across a Hadoop cluster (Which may occur during the run of a PipelineStage). The SequentialExecutionStage enables the definition of a sequential process flow in which the PipelineStages in the Stages property are called in a sequential order, i.e. the execute() or train() method (depending on mode and class) are called for a PipelineStage only if the preceding PipelineStage has returned. Within the PipelineStage, the Collection of PipelineStages is a (java.util.)List so that the PipelineStages are in a linear order. Additionally, during runtime the SequentialExecutionStage will change the Input.Path and Output.Path properties of the contained PipelineStages so that the Output.Path of the preceding PipelineStage will be the Input.Path of the following PipelineStage, so that processing results will be passed down the Pipeline. The ParallelExecutionStage enables the definition of parallel process flows. The action methods of the PipelineStages contained in the Stages property are called without regard to order and run in separate parallel threads. The ParallelExecutionStage represents a fork or split in the sense that the PipelineStages contained can be run independently of each other. The Input.Path of all contained PipelineStages is set to the Input.Path of the
91
ParallelExecutionStages Input.Path property while the output of the contained PipelineStages will be stored in enumerated directories in the path set in the ParallelExecutionStages Output.Path property. Assembly and PipelineML In order to store, edit, and exchange Pipeline configurations, PipelineStages can be serialized to and de-serialized from XML. The resulting XML must conform to the XML schema PipelineML ( Appendix 8D PipelineML, p.137). In order to provide this capability, PipelineStages must implement the Assembleable interface (Figure 26). The method createAssemblyConfig returns an XML document, which can be written to file. The method is implemented in the PipelineStage class and can be overwritten by subclasses if necessary. PipelineML documents can be read using the loadConfig method of the AssembleConfig class. The class will read and parse the XML. The Pipeline can then be assembled using the Assembler class assemblePipeline method, which will instantiate the appropriate PipelineStage class and configure it.
<<interface>> Assembleable +assemble(config: AssembleConfig) +createAssemblyConfig(d: Document) +createAssemblyConfig(d: Document, n: Node): Node Assembler +assemblePipeline(config: AssembleConfig): PipelineStage +assemblePipeline(config: AssembleConfig, conf: Configuration): PipelineStage +assemblePipeline(className: String): PipelineStage AssembleConfig +AssembleConfig() +AssembleConfig(File file) +AssembleConfig(InputStream in) +getClassName(): String +getELementsByName(String name): List<Node> +getId(): string +getProperties(): Properties +loadConfig(File file): void +loadConfig(InputStream in): void +loadConfig(Node node): void
PipelineStage
+getCurrentScope(): String +getDefaultProperties(): Properties +getId(): String +getName(): String +getProperties(): Properties +getScopes(): String[] +inScope(): Boolean +setCurrentScope(String scope): void +setId(String id): void +setProperties(Properties p): void
Properties +defaults: Properties +Properties() +Properties(arg0: Properties) +setProperty(arg0: String, arg1: String) +load(arg0: InputStream) +save(arg0: OutputStream, arg1: String) +store(arg0: OutputStream, arg1: String) +getProperty(arg0: String) +getProperty(arg0: String, arg1: String) +propertyNames() +list(arg0: PrintStream) +list(arg0: PrintWriter)
Properties 1 1
Figure 26: Pipeline assembly 6.4.2 Sentiment Analysis
To establish the capability in DOX to analyse the sentiment contained in opinions process primitives as outlined in chapter 3 were implemented. The process primitives, semantically steps of an analysis process, are all subclasses of the PipelineStage class and can be assembled into an analysis process.
92
The implemented classes can be distinguished into the generic types of process primitives established previously in chapter 3 and illustrated in Figure 27. Implemented have been Transformation and Filter steps as Preprocessing steps and Sentiment Classifiers as Extraction steps.
Process Step
Preprocessing
Transformation
Tokenizer
Vectorizer
Converter
Extraction
Figure 27: Generic Types 6.4.3 Preprocessing
Transformation Tokenizer Tokenizer split input text in parts. Three tokenizer have been implemented: A sentence splitter, and two word tokenizer. The sentence splitter (SentenceSplitterStage) uses OpenNLPs Sentence Detector (Apache OpenNLP Development Community, n.d.) to divided up input text into sentences. The Sentence Detector, instantiated with a classification model for the English language, is wrapped in a Mapper class which applies it to every input record and emits every sentence in a new output record. Therefore the input record with the value I like bacon and eggs. Dont know why, just do. will result in two output records. The output key of each record will be the same as the input key, typically the id of the submitted document. Word tokenizer split text input into words and emit the tokens of each record as a collection of strings in the form of a StringTuple62. Two word tokenizer have been implemented, one using OpenNLPs Tokenizer (StandardTokenizerStage) and the other using Mahouts DocumentProcessor (MahoutTokenizerStage). OpenNLPs Tokenizer is also instantiated with a classification model for the English
62
org.apache.mahout.common.StringTuple
93
language and wrapped in a Mapper class is used in a similar fashion as is the Sentence Detector described above. Mahouts DocumentProcessor is in itself a driver for a MapReduce job and the MahoutTokenizerStage only configures it appropriately before it runs it. This tokenizer has the advantage that an Analyzer63 class can be specified in the properties (by its fully qualified classname) which can direct the tokenizing process. Vectorizer Vectorizer create vector representations of the input. The implemented vectorizers will create a vector representation of the input tokens. Input records value must be of type StringTuple (A collection of tokens). The output will be VectorWritables 64 which contain the feature vectors65. Two vectorizer types have been implemented: Dictionary vectorizers and feature hashing vectorizers. Dictionary vectorizer require two passes through the data to create vectors (Owen et al., 2012). The first to create the dictionary, a set of <String,Integer>-mappings which map terms to positions in the vector, and the second pass to create the vectors. In a 2-phase training/execution model the term dictionary is created based on the training data in the training phase, therefore requiring only one pass in the execution phase. The StandardDictionaryVectorizerStage is a dictionary vectorizer which creates binary presence vectors from StringTuples. The vectorizer creates the dictionary by emitting all tokens as keys during the map phase and then simply counting the keys in the reduce phase, emitting an output record for each key as a <term,count> pair which forms the dictionary. The count is the position of the vector in which the presence/absence of term will be stored. The MahoutDictionaryVectorizerStage uses DictionaryVectorizer to create term frequency (TF) vectors. Mahouts
The FeatureHashingValueStage creates vectors which encode the token features by hashing. A mapper wraps the vectorization, and vectorizes one record per call of the map function. The hash function maps each token to multiple positions of the output vector. A StringTuple is encoded by adding the hash values of every token to the output vector, thus creating a hash for the StringTuple. The encoder used is
org.apache.lucene.analysis.Analyzer org.apache.mahout.math.VectorWritable 65 org.apache.mahout.math.Vector

63 64
94
Mahouts unweighted StaticWordValueEncoder66 (Owen et al., 2012) which uses the MurmurHash67 hash function. Converter Vector converters convert one vector representation in another. Two have been implemented: The MahoutBinaryConverterStage converts TF or TFIDF vectors into binary presence vectors by replacing every non-zero value by 1. The MahoutTfidfConverterStage converts TF vectors into TFIDF vectors using Mahouts TFIDFConverter. Converters can add additional information to the input, for example explicitly adding negation or POS information to words. The HeuristicNegationStage uses a heuristic to add negation annotation to input text. It determines negated sequences by finding a word from a character sequence list indication negation ("not ", "n't"), then finds the shortest sequence which ends with a character sequence which indicates the end of the line or a change of meaning (",", ".", "!", "?", ";", "but"). If a double negative is found within that sequence, i.e. another character sequence from the first list, no action is performed. Otherwise the prefix NOT_ will be affixed to every whitespace separated word within the sequence boundaries. The Negator will then search the remainder of the input text, from the last negation sequences boundary onwards, to find the next negation sequence. The HeuristicNegationStage will emit an annotated text record for every input record. 6.4.4 Sentiment Classification
Sentiment classifiers classify text input into two categories: Positive sentiment or negative sentiment. Output records have the form <Text:DocumentId,Text:CategoryLabel>. Sentiment classified as positive is labelled Positive while sentiment classified as negative will be labelled Negative. Two classifiers have been implemented: One Nave Bayes classifier (NaiveBayesClassifierStage) and one ensemble of Logistic Regression which uses Stochastic Gradient Descent for training. Both use the classifier algorithms provided by Mahout.
66 67
org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder https://sites.google.com/site/murmurhash/
95
The implemented classifiers implement the ExecutableStage and TrainableStage interface and therefore operate in two phases: Training and Execution (Production use). Training A call of the train() method will first start a MapReduce job which randomly partitions the training data into a training and test set. The "Test.Percentage" property which accepts values between 0 and 1 defines how much of the original training data will be set aside for testing, per default 10% (Value 0.1). The records are split using a three concurrent MapReduce jobs: The first assigns a random number between 0 and 1 as key, the second and third read the output file and write the records to file if the key value is larger or smaller respectively than the value given as the Test.Percentage property, thus partitioning the training data into training and test sets. After this, the classifier is trained using the training set. The resulting model will be serialized to the file system as defined in the "Model.Path" property of the pipeline. After training, the classifier is tested using the test set. The test result is written to the path as defined in the "Test.Result.Path" property. Storage of the test result is necessary to assess the quality of the classifier at a later point without having to run the test again. Execution The design of the classifiers in the execution phase is similar. The classification process is a simple MapReduce job: The Mapper class wraps the classifier, which is instantiated and configured by loading a serialized model from the HDFS file system. This design allows to more than one model to be trained and the model used can be exchanged if necessary. The Mapper expects input records to have the form of <Text: DocumentId, VectorWritable: Feature vector>. The classifier is applied to each input vector and the result is emitted as a pair <Text:DocumentId, Text:Category label>. In the reduce phase the labels for each key are counted and the mode value is emitted, i.e. the label which has the highest count. Through this mechanism a document can be classified by classifying each sentence individually (which all share the documents id as key) and afterwards assigning the document to the category to which most of its sentences belong.
96
Classifiers NaiveBayesClassifierStage The NaiveBayesClassifierStage uses the parallel version of Mahouts Nave Bayes classifier to train the classifier. The classifier is trained by running the TrainingNaiveBayesJob driver which starts the MapReduce training job. SgdClassifierStage The SgdClassifierStage implements a Logistic Regression (LG) classifier trained by the Stochastic Gradient Descent (SGD) method. As of the time of writing the current release version of Mahout (0.7) does not implement a parallel version of the algorithm (Drost & Ingersoll, 2011; Owen et al., 2012). This however allows the demonstration of the ensemble technique to parallelize classification using sequential classification algorithms: A number of classifiers are trained simultaneously, and a large training set is divided (approximately) equally among them. In execution, all classifiers are applied independently to the data and the final classification result is the combined result of the individual results as calculated by an aggregation function, typically majority voting. The ensemble training is implemented as a MapReduce job: The Mapper will randomly shuffle the records of the training set and distribute them to the Reducers which will each train one classifier and thus generate one model. During the map phase, a random integer from the interval [ [ is assigned to each record as a key, in which is the number of models being generated (i.e. number of reducers). In the reduce phase each of the Reducers will instantiate one AdaptiveLogisticRegression 68 learner. Mahouts AdaptiveLogisticRegression will train a number of Logistic Regression models simultaneously using an evolutionary algorithms to set the learning parameters (such as earning rates, regularization parameters, and annealing schedules) automatically (Drost & Ingersoll, 2011), removing the need to identify the correct settings manually. After training, each reducer will write one output file with one record. The record will contain the reducers id ([ [) as the key and the serialized model of the most accurate classifier as its value. The type of the value will be CrossfoldLearner 69 . The output will be written to the HDFS as set in the SgdClassifierStages "Model.Path" property.
68 69
org.apache.mahout.classifier.sgd.AdaptiveLogisticRegression org.apache.mahout.classifier.sgd.CrossFoldLearner
97
In execution, the SgdClassifierStage will load all serialized models from the "Model.Path" for each Mapper. Each input record will be classified by each classifier independently and the Mapper will emit a single record for each result as a <Text: Id, Text: Category label> pair. In the reduce phase, category labels for each id are counted and the mode value (label with the highest count) are emitted as the final classification result for the key, thus implementing majority voting. 6.4.5 Aggregation
The analysed documents are structured in a strict hierarchy, which allows the user to analyse aggregated results. The hierarchy is shown in Figure 28.
all
Source-Type Group Source Document
http
twitter
hashtag
host
user
page
tweet
Figure 28: Aggregation Hierarchy The results of the basic sentiment orientation analysis are aggregated by summation along that hierarchy using a MapReduce job. As a result, for every job and segment two files are produced: (A) aggLevels, containing the elements of the aggregation hierarchy; and (B) aggData containing the for every aggregation level the number of positive and negative documents. 6.4.6 Document Summaries
The DocumentSummarizer creates summaries of documents by sentence extraction. The DocumentSummarizer creates a summary with a maximum length of sentences, by clustering. The sentences for each document are clustered into clusters using Mahouts KMeans clustering algorithm. Clustering is appropriate, because the algorithm will cluster similar sentences into the same cluster while attempting to maximize the dissimilarity between the clusters. In other words, sentences which express the same content will be lumped together while the cluster themselves will contain sentences which are different from the sentences contained in other cluster. For
98
each cluster a representative sentence is selected, which is the sentence closest (as measured by the distance measure) to the clusters centroid. The selected sentences are then concatenated to create full text summaries of the documents. The implementation works in the following fashion, illustrated in Figure 29: The Summarizer expects two input types of input documents: One containing the original sentences, and another containing vector representations of the sentences. Vectors must be a subclass of the NamedVector70 class and carry the unique sentence id as their name. This will typically be a TFIDF-vector, as TFIDF encoding will give topic words, which are assumed to occur less often, more weight (Owen et al., 2012). The keys for a sentence and its vector representation must be identical, while being unique. The format of the key must comply with the text representation of the DocumentSentenceKey class. The DocumentSentenceKey class is a Writable class which stores three kinds of data: The id of the document (as a String), the number (position) of the sentence in the document (as an Integer), and whether or not the sentence number should be taken into consideration when comparing instances of the class (as a Boolean). The latter property can be used to shuffle sentences of one document to a reducer without losing the information on the sentence position. The DocumentSentenceKey class can be parsed to and from the Text class, which allows its text representation to be used as a key to comply with the normal process flow. In a first MapReduce job, all vector representations of sentences belonging to a single document are written to a single file. One file is created for each document. This is necessary, because the clustering algorithm will cluster the sentences of each individual document. In the map phase of this first step, all records are read and the key is parsed as a DocumentSentenceKey. The records are then emitted using only the document as key. In the reduce phase, all vectors belonging to a document are written to an individual file. For all documents containing or less vectors, clustering is bypassed and the sentence ids are written to file for later lookup. For each document containing more than sentences, the following steps are executed: First random centroids are created in the vector space defined by the documents vectors using Mahouts RandomSeedGenerator 71 . These are the seed for the clustering algorithm. Then the sentence vectors are clustered by k-Means clustering72
70 71
org.apache.mahout.math.NamedVector org.apache.mahout.clustering.kmeans.RandomSeedGenerator 72 org.apache.mahout.clustering.kmeans.KMeansDriver
99
using Cosine Distance as the measure. The outputs of this are two files, one containing the calculated clusters and their centroids and another containing the clustered points (i.e. the clustered vectors). Clusters have an id, which is the key of the records in the clustered points file. In the next step, both files are joined on the cluster ids and the distance of the clustered vectors is measured against their clusters centroid vector to find the most representative vector, the one with the smallest distance from the centroid. Output of this step is a file which lists the ids of all representative sentences (same as if the clustering had been bypassed). All files containing the representative ids are then processed as a batch and joined with the files containing the original sentences on the sentences ids (key of both files). In a final step, all representative sentences of a single document are concatenated to form the summary of that document. The final output file will contain a record for each processed document; the key of the record will be the id of the summarized document and the value the summary.
100
<Text: DocumentSentenceId, Text: Sentence>
<Text: DocumentSentenceId, VectorWritable:TFIDF Vector>
Split into separate Jobs by Document
For each documents
More than n sentences? no yes
Generate n random centroids
Bypass
Cluster sentences
Sentence lookup Find representative sentences
Document Summary
End of iteration
<Text:DocumentId, Text:Summary>
Figure 29: Document Summarizer
6.4.7
Opinion Detection
The OpinionDetector is an attempt to identify opinions based on semantic similarity. It is based on the notion that documents which express similar opinions should exhibit semantically similar content. The OpinionDetector uses a two-step clustering process which first clusters sentences on semantic similarity and then
101
documents based on their sentence signature. The sentence signature of a document is a binary presence vector of reference sentences identified in the first clustering step. The concept is inspired by the idea of Akcora et al. (Akcora et al., 2010), described in section 4.2.3. The authors identified breakpoints in twitter streams by identifying changes in the emotion and semantic pattern. The latter was identified by measuring the Jaccard distance between the union token sets of tweets taken in two adjacent time frames. This demonstrated that semantic similarity can contribute to the identification of dissimilar opinions. However if this is possible, it should also be possible to divide documents along these dissimilarities into groups representing different opinions. Hence the idea of using clustering mechanisms based on appropriate similarity measures to identify semantically similar opinions. An methodology to identify ex-post breakpoints in news stories has been presented in (P. Hu et al., 2011), which shares some similarity with the problem at hand: The authors use a Hidden Markov-Model to identify variance in the stream of news articles, modelling themes with generative probabilistic mixture models. Although this methodology should fulfil a similar function, it exhibits the major drawback of having to identify the number of themes in advance. Translated to the problem at hand, the application of this methodology would require the identification of the number of opinions in advance. Using a clustering algorithm instead of probabilistic models to fulfil the task can be seen as a more data-driven approach. And although the -Means clustering algorithm exhibits the same fault at first glance, the requirement of identifying the number of clusters, , ex-ante, heuristic canopy generation can be seen as a practical solution for this problem. Generating ex-ante clusters heuristically using an inexpensive canopy generation algorithm with sensible distance measures to feed to the k-Means cluster generator as input enables the identification of an unknown number of opinions in a document collection. The process is consists of two steps: It will first cluster sentences of all documents, and then documents based on their sentence signatures. As sentences can be considered as semantically coherent units (P. Hu et al., 2011; Pang & Lee, 2008), i.e. they should be coherent in topic and sentiment orientation, they can be seen as atomic opinions. In this view, a document would consist of a number of atomic opinions. To identify similar atomic opinions across a document collection, sentences should be clustered based on their similarity. The distance measure used to identify similar sentences is the Cosine Distance measure applied to the tokenized sentences. The OpinionDetector uses Mahouts CanopyDriver73 to estimate the appropriate number of sentence clusters,
73
org.apache.mahout.clustering.canopy.CanopyDriver
102
and will then cluster the sentences using the Mahouts k-Means implementation74. An intermediate step will then create signature vectors for each document. Signature vectors are binary presence vectors of dimensions , being the number of sentence clusters generated in the preceding step. A sentence is encoded in a signature vector by setting the value of position ( ) to 1, where ( ) is the id of the cluster two which sentence belongs. For every sentence of a document , the signature vector ( ) ( ) of will be the binary multiplication of all sentence signatures ( ). As documents are thus made up of the same building blocks, they should be comparable to each other. In the second step, the signature vectors of the documents in the collection are clustered. Again, first the number of clusters is estimated by canopy generation. Then the signatures are clustered using the Cosine Distance measure. Output of the clustering process is a set opinion-document relationships which can be interpreted as document expresses opinion relationships. 6.5 Presentation
The user interface has been implemented as a web application using Java Enterprise Edition (JEE). It supports the user in the following tasks: System administration Creation/Edition of tracking jobs Exploration of analysis results Creation/Edition of pipelines
The web application loads the result of the analysis and stores the data in its internal database, thus allowing faster access to it. 6.5.1 Pipeline Editor
The Pipeline Editor allows the user to assemble and configure analysis pipelines graphically by dragging and dropping PipelineStages from the Toolbox into the editor. By clicking on a PipelineStage in the editor, the user can configure the properties of the PipelineStage (Figure 30).
74
org.apache.mahout.clustering.kmeans.KMeansDriver
103
Figure 30: Pipeline Editor 6.5.2 Dashboard
The Dashboard allows the user to explore the results of the analysis (Figure 31). For every job, the Dashboard will show the overall development of sentiment in the line chart at the top. The user can select a snapshot in the UI-element below. The graph visualisation will show the currently selected aggregation level and its children (sub level), the nodes are colored according to their aggregated sentiment orientation. The color coding is purely green if the node is completely positive, purely red if negative, and a shade in between if the sentiment distribution is mixed. The user can click on nodes to drill deeper through the hierarchy to explore further.
Figure 31: Dashboard
104
Evaluation
The system has been evaluated in line with standards set out in section 2.5. To evaluate the accuracy and robustness of the classifiers, quality assessment tests have been carried out with test data sets. In order to evaluate speed and scalability, efficiency tests have been carried out. 7.1 Quality Evaluation
In order to assess the quality of the sentiment classification components and the OpinionDetector component, they have been trained and/or executed on labelled data sets. The accuracy measure is used to report the quality level of the result. 7.1.1 Sentiment Classification
Pipelines Two analysis pipelines have been configured: One uses the NaiveBayesClassifierStage (Nave Bayes classifier) as its sentiment orientation classifier; the other uses the SgdClassifierStage (Logisitic Regression classifier). Both pipelines use the same preprocessing steps to allow for comparison between the classifiers: Training data sets are loaded, then split into sentences (only in the execution phase, as training sets are already split into sentences), heuristic negation is applied (configurable), and the DictionaryVectorizerStage uses the OpenNLP tokenizer to tokenize the sentences and vectorizes them into unigram presence vectors. No feature selection has been applied. Training sets Training sets contain labeled sentences; training instances are assigned the label Positive if the instance has a positive sentiment orientation and Negative if the instance has a negative sentiment orientation. The classifiers will also use the binary classification scheme {Positive,Negative}. Three training sets have been compiled: The first data set contains instances derived from the MPQA75 data set, which contains news articles (Wilson et al., 2009). The original documents contain annotations for sentence subsections. The annotations have been translated into the ternary scheme {Positive,Negative,Objective} which translates into real value vectors. The sum of the vector values must be either 1 or 0. The translation has been done using an automated tool with a translation file. The translation file can be found in the appendix
75
http://www.cs.pitt.edu/mpqa
105
(Table 13, p.138). The training data set created contains only record which have an objective value of less than 0.5. This data set contains about 6000 instances, equally distributed across the classes Positive and Negative. The second data set contains twitter status updates and it has been compiled using the emoticon trick. Tweets containing the emoticons { ":)", ":-)", ";)", ";-)", ": )",":D", ":D", "=)" } have been labeled Positive, tweets containing the emoticons { ":(", ":-(", ": (" } have been labeled Negative. The messages have been sanitized by removing the emoticons (To prevent target leak), by removing all user mentions (@Username), by removing all URLs, line breaks, and unnecessary whitespaces. Over the course of a day, a set of about 500.000 tweets has been collected. However, the distribution of tweets was very much skewed to the positive class, so that a training set has been constructed containing 160,000 tweets. In the training set, the classes are roughly equally distributed. The third training set is a combination of both training sets, containing 6000 instances from the MPQA training set and 6000 instances from the twitter training set, 12,000 instances in total. Again, the amount of Negative and Positive instances contained in the set has been balanced so that both classes are equally large. Evaluation Both pipelines have been tested 6 times, twice with each training set: Once without heuristic negation and once with it. The pipelines have been trained on 90% of the instances of each set, while 10% have been withheld for testing (Random sampling with a 90/10 split). The results of the runs are reported in Table 7 below (NB=Nave Bayes; SGD= Logistic Regression, trained by SGD; +Neg with heuristic negation; -Neg without heuristic negation). Data Set MPQA NB Neg NB +Neg SGD Neg SGD +Neg .756 .750 .704 .71 Twitter .757 .769 .685 .72 Combination .899 .911 .727 .897
Table 7: Accuracy results of the evaluation runs
106
Surprisingly, the Nave Bayes classifier performs better than the Logistic regression across all test sets. The accuracy of the Nave Bayes classifier is in the range of .76 +/.01 for the MPQA and twitter data sets, while the Logistic Regression achieves a best accuracy of .71 and .72 for the data sets respectively. A surprising result is the increased accuracy on the combined test set. The Nave Bayes classifier achieves an accuracy of .911 while the Logistic Regression achieves .897. The intuition would have suggested a drop of accuracy for the combined test set, stemming from increased noise. At this point there is no adequate explanation for the improved performance, and this will have to be investigated further. Another notable fact is the visible improvements in accuracy if heuristic negation is applied: For all but one case (Nave Bayes on the MPQA data set) accuracy improved if negation has been applied. Overall, the reported accuracy for the MPQA and twitter test sets is within expectation for this kind of analysis process. 7.1.2 Opinion Detection
The evaluation of the OpinionDetector is difficult because no formalized test sets and test standards have been published. Thus the test set has been constructed from 21 amazon product reviews for a smartphone product. The test set contains 10 reviews giving a five-star-rating, and 11 reviews giving a one-star-rating. The intuition is that these contrastive reviews are also contrastive in content. The OpinionDetector should therefore cluster them in different opinion clusters. The results are reported using the accuracy measure, in which an instance is considered correctly classified if it is in the majority of the cluster, and incorrectly if it is not. For example: A five-star-rating placed in a cluster containing three one-star-ratings is considered incorrectly clustered.
Cluster Documents 1-Star 5-Star Correct Incorrect Accuracy 0 1 1 0 1 0 1 1 4 3 1 3 1 .75 2 10 3 7 7 3 .7 3 1 1 0 1 0 1 4 1 1 0 1 0 1 5 1 1 0 1 0 1 6 1 1 0 1 0 1 7 Total 2 0 2 2 0 1 21 11 10 17 4 .81
Table 8: Accuracy of the OpinionDetector The OpinionDetector created eight clusters and achieved and accuracy of .81 in distinguishing one-star-reviews from five-star reviews on content alone. It is interesting to note that all but one five-star-reviews were placed in only two clusters, while negative reviews were placed in spread across seven clusters, five of which only contain a single document. This would suggest that reviewers (of this product), who give a
107
favorable rating will probably give the same reasons for liking the product, while negative reviewers seem to express their dislike individually. 7.2 Efficiency Evaluation
The processing efficiency and scalability has been tested by training the NB Neg pipeline introduced in section 7.1.1 on a large data set. The data set has been created by copying the complete twitter dataset 10 times, thus creating 5 million training records. The training has been performed three times in different computing environments, each a cluster of Amazon EC2 nodes of type Small Instance (1.7 GB Speicher, 1virtual core, 160GB HDD). The first cluster consisted of one node, the second of two nodes, and the third of four nodes. The time it took the pipeline to train on the dataset is reported in Table 9 and illustrated in Figure 32 (cumulative). The time is reported in seconds. As can be seen from the data, the preprocessing takes orders of magnitude longer than the actual training on the data set, although only tokenization and vectorization are performed. The data also clearly shows that the addition of nodes reduces the overall running time as expected. It does also show that the addition of nodes has diminishing returns on saved time.
Cluster Nodes Preprocessing Training Cluster 1 1 6183 600 Cluster 2 2 2810 257 Cluster 3 4 1068 177
Table 9: Runtime of pipeline
Scalability
8000 Operation ended (seconds) 7000 6000 5000 4000 3000 2000 1000 0 Start Preprocessing Training Cluster 1 Cluster 2 Cluster 3
Figure 32: Scalability
108
Concluding Remarks
The research of computer-supported opinion analysis is a difficult, yet important field of research. The analysis of opinions can be a valuable tool in a number of fields, chief among them product design and marketing, political decision making, and social research. Analysis systems can also be important to individuals, for example opinion search engines which can support individual decision making. This thesis contains the result of work on three particular research points: Development of a framework for opinion mining applications; Review of parallelized processing approaches; Investigation of technological solutions to track opinions.
The reference framework has been established and allows the description of opinion analysis processes. It contains the modules document collection, analysis, and presentation which are connected by a downward flowing process. The analysis module allows the construction of freely configurable analysis processes, which are split in preprocessing, extraction, and postprocessing phases. The distributed and parallel execution of opinion mining and analysis systems has been reviewed on the basis of the MapReduce parallelization architecture, and design solutions for data processing and classification for opinion mining systems have been presented. Both research efforts have built the foundation for DOX, a system prototype for opinion mining, which has been realized as part of this thesis. The system is Java-based and designed to work in a Hadoop environment in a parallel and distributed manner. It builds on a number of open-source libraries such as Nutch, Mahout, and OpenNLP which provide essential basic capabilities to the system. Nutch is used as distributed crawler, Mahout as a provider of machine-learning capabilities, and OpenNLP for natural language processing capabilities. All subcomponents used a strictly open-source and freely available, thus allowing others the replication and adaption of the system for their purposes. The system realizes the requirement for configurable analysis processes dictated by the framework through a design which allows the construction and configuration of analysis processes, called Pipelines. Pipelines can be designed and reused, and configurations can be serialized and deserialized into XML documents which adhere to the PipelineML schema. The system is also extensible, allowing new Java classes (and therefore capabilities) to be added with ease. Dox provides a number
109
of preprocessing steps, two classifiers (Nave Bayes and Logistic Regression), a clusterbased extractive document summarizer, and a new technique to distinguish opinions semantically. The OpinionDetector, which provides the latter functionality, is a novel way to identify opinions based on content and can be used in conjunction with sentiment orientation classification. It is a major component in the effort to track opinions across multiple sources, in that it allows the identification of similarity based on semantic content. The OpinionDetector uses clustering to identify similar sentences in a collection. The sentence clusters are then treated as a building block repository for the document collection and documents are seen as a construct of these representative sentences. The sentences are therefore treated as atomic opinions, and signature vectors of documents (binary presence vectors) encode the presence or absence of those atomic opinions in a document. Documents are finally clustered into opinion groups through signature vector similarity. The system has been evaluated and was able to report a reliable accuracy widely better than chance for all classifiers on the evaluation data sets. The evaluation of the OpinionDetector has shown an accuracy of .80 in distinguishing contrastive reviews, however a new test methodology has to be developed to ascertain the validation of an optimal number of opinion clusters and to ascertain that opinions are clustered correctly for less contrastive documents. Overall, the system provides the capabilities of a proper opinion analysis tool. Future work in on the system will concentrate on the addition of a subjectivity filter extension, and transformations which enhance the accuracy of the classifier like POS tagger and other natural language processing techniques which could improve the feature space. Additionally, the improvement of the system through inclusion of other Hadoop-affiliated subcomponents should be evaluated, like Hive for storage, or ZooKeeper and Oozie for workflow coordination. A great deal of research into the identification of similar opinions across multiple online sources has still to be done. The OpinionDetector right now is the result of an informed ad-hoc approach which still requires a lot of speculative configuration and evaluation. All in all, this thesis demonstrates the potential of large-scale sentiment mining using Hadoop and delivers a system which demonstrates the findings.
110
References
Abbasi, A., & Chen, H. (2007). Affect Intensity Analysis of Dark Web Forums. 2007 IEEE Proceedings of Intelligence and Security Informatics (pp. 282288). Ieee. doi:10.1109/ISI.2007.379486 Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages. ACM Transactions on Information Systems, 26(3), 134. doi:10.1145/1361684.1361685 Agarwal, A., Chapelle, O., Dudik, M., & Langford, J. (2012). A reliable effective terascale linear learning system. arXiv:1110.4198v2. Retrieved from http://arxiv.org/abs/1110.4198 Agrawal, R., & Rajagopalan, S. (2003). Mining newsgroups using networks arising from social behavior. Proceedings of the 12th international conference on World Wide Web (pp. 529535). New York, NY, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=775227 Akcora, C. G., Bayir, M. A., Demirbas, M., & Ferhatosmanoglu, H. (2010). Identifying breakpoints in public opinion. Proceedings of the First Workshop on Social Media Analytics - SOMA 10 (pp. 6266). New York, New York, USA: ACM Press. doi:10.1145/1964858.1964867 Andreevskaia, A., & Bergler, S. (2006). Mining WordNet for Fuzzy Sentiment: Sentiment Tag Extraction from WordNet Glosses. Proceedings of the European Chapter of the Association for Computational Linguistics (EACL) (pp. 209216). Apache OpenNLP Development Community. (n.d.). Apache OpenNLP Developer Documentation. opennlp.apache.org. Retrieved November 10, 2012, from http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html Argamon, S., Koppel, M., & Avneri, G. (1998). Routing documents according to style. First international workshop on innovative information systems. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.688&rep=rep1&type =pdf Attardi, G., & Simi, M. (2006). Blog mining through opinionated words. Proceedings of TREC, 27. Retrieved from http://www.zowijs.net/kennisdelen/wpcontent/uploads/tdomf/130/blog opinion mining.pdf Baccianella, S., Esuli, A., & Sebastiani, F. (2008). S ENTI W ORD N ET 3 . 0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining, 0, 2200 2204. Bae, Y., & Lee, H. (2011). A Sentiment Analysis of Audiences on Twitter: Who Is the Positive or Negative Audience of Popular Twitterers?, 732739.
111
Balahur, A., & Steinberger, R. (n.d.). Rethinking Sentiment Analysis in the News: from Theory to Practice and back, 112. Balahur, A., Steinberger, R., Goot, E. Van Der, Pouliquen, B., & Kabadjov, M. (2009). Opinion Mining on Newspaper Quotations. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 523526. doi:10.1109/WI-IAT.2009.340 Banfield, A. (1982). Unspeakable sentences: Narration and representation in the language of fiction (p. 340). Boston, MA, USA: Routledge & Kegan Paul. Retrieved from http://books.google.de/books/about/Unspeakable_Sentences.html?id=qJg9AAAAI AAJ&redir_esc=y Banko, M., & Brill, E. (1998). Scaling to Very Very Large Corpora for Natural Language Disambiguation, 2633. Bautin, M., Ward, C., Patil, A., & Skiena, S. (2010). Access: news and blog analysis for the social sciences. Proceedings of the 19th (pp. 12291232). New York City, New York, USA: ACM Press. Retrieved from http://dl.acm.org/citation.cfm?id=1772889 Bermingham, A., Conway, M., McInerney, L., OHare, N., & Smeaton, A. F. (2009). Combining Social Network Analysis and Sentiment Analysis to Explore the Potential for Online Radicalisation. 2009 International Conference on Advances in Social Network Analysis and Mining, 231236. doi:10.1109/ASONAM.2009.31 Bifet, A., & Frank, E. (2010). Sentiment Knowledge Discovery in Twitter Streaming Data, 115. Binali, H., Potdar, V., & Wu, C. (2009). A state of the art opinion mining and its application domains. 2009 IEEE International Conference on Industrial Technology, 16. doi:10.1109/ICIT.2009.4939640 Bishop, C. (2006). Pattern recognition and machine learning. New York City, New York, USA: Springer Science+Business Media LLC. Retrieved from http://www.library.wisc.edu/selectedtocs/bg0137.pdf Bodendorf, F., & Kaiser, C. (2009). Detecting opinion leaders and trends in online social networks. Proceeding of the 2nd ACM workshop on Social web search and mining - SWSM 09, 65. doi:10.1145/1651437.1651448 Boiy, E., Hens, P., Deschacht, K., & Moens, M. (2007). Automatic Sentiment Analysis in On-line Text Concepts of Emotions in Written Text Concept of Emotions, (June). Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Compstat. Retrieved from http://leon.bottou.org/publications/pdf/compstat2010.pdf
112
Brants, T., Popat, A. C., & Och, F. J. (2007). Large Language Models in Machine Translation, 1(June), 858867. Cabral, L., & Hortacsu, A. (2004). The dynamics of seller reputation: Theory and evidence from eBay. Cambridge, MA, USA. Retrieved from http://www.nber.org/papers/w10363 Cai, K., Bao, S., Yang, Z., Tang, J., Ma, R., Zhang, L., & Su, Z. (2011). OOLAM: an opinion oriented link analysis model for influence persona discovery. Proceedings of the fourth ACM international conference on Web search and data mining (pp. 645654). New York City, New York, USA: ACM Press. Cardie, C., Farina, C., & Bruce, T. (2006). Using natural language processing to improve eRulemaking: project highlight. Proceedings of the 2006 international conference on Digital government research (pp. 45). Digital Government Society of North America. Retrieved from http://dl.acm.org/citation.cfm?id=1146598.1146651 Cardie, C., Wiebe, J., Wilson, T., & Litman, D. (2003). Combining low-level and summary representations of opinions for multi-perspective question answering. In Working Notes - New Directions in Question Answering (pp. 2027). AAAI. Retrieved from http://www.aaai.org/Papers/Symposia/Spring/2003/SS-0307/SS03-07-004.pdf Chaovalit, P., & Zhou, L. (2005). Movie review mining: A comparison between supervised and unsupervised classification approaches. Proceedings of the Hawaii International Conference on System Sciences (HICSS) (Vol. 00, pp. 112121). Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1385466 Chauchat, A. S. J., Eric, L., Lumire, U., & Mends-france, P. (n.d.). Opinion Mining Issues and Agreement Identification in Forum Texts, 5158. Chen, H., & Zimbra, D. (2010). AI and Opinion Mining. IEEE Intelligent Systems, 25(3), 7480. doi:10.1109/MIS.2010.75 Chen, W., Zong, L., Huang, W., & Ou, G. (2011). An empirical study of massively parallel bayesian networks learning for sentiment extraction from unstructured text. Web Technologies and , 424435. Retrieved from http://www.springerlink.com/index/QR43263807617U78.pdf Cheong, M., & Lee, V. C. S. (2010). A microblogging-based approach to terrorism informatics: Exploration and chronicling civilian sentiment and response to terrorism events via Twitter. Information Systems Frontiers, 13(1), 4559. doi:10.1007/s10796-010-9273-x Cho, K. S., Jung, N. R., & Kim, U. M. (2010). Using wordmap and score-based weight in opinion mining with mapreduce. 2010 IEEE International Conference on Service-Oriented Computing and Applications (SOCA) (pp. 14). Ieee. doi:10.1109/SOCA.2010.5707188
113
Choi, Y., Breck, E., & Cardie, C. (2006). Joint extraction of entities and relations for opinion recognition. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 431439). Retrieved from http://acl.ldc.upenn.edu/W/W06/W06-1651.pdf Choi, Y., & Cardie, C. (2008). Learning with Compositional Semantics as Structural Inference for Subsentential Sentiment Analysis, (October), 793801. Choi, Y., Cardie, C., Riloff, E., & Patwardhan, S. (2005). Identifying sources of opinions with conditional random fields and extraction patterns. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 355362). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1220620 Chu, C., Kim, S., Lin, Y., & Yu, Y. (2007). Map-reduce for machine learning on multicore. Advances in neural . Retrieved from http://books.google.com/books?hl=en&lr=&id=Tbn1l9P1220C&oi=fnd&pg=PA28 1&dq=Map-Reduce+for+Machine+Learning+on+Multicore&ots=V2ofCnpoX&sig=yqnbnaH6GINR9GOA_VTgvjwBV2M Church, K. W., & Hanks, P. (1989). Word association norms, mutual information and lexicography. Proceedings of the 27th Annual Conference of the AC (pp. 7683). Brunswick, NJ, USA: Association for Computational Linguistics. Conway, D., & White, J. (2012). Machine Learning for Hackers. (J. Steele & et al., Eds.) (First Edit.). Sebastopol, CA, USA: OReilly Media, Inc. Retrieved from http://books.google.com/books?hl=en&lr=&id=hK3icX75MdQC&oi=fnd&pg=PR 3&dq=Machine+Learning+for+Hackers&ots=N4GUvR4pgr&sig=3BBY3sBycWj RXDFFQLHSyi3_jtw Das, S., & Chen, M. (2001). Yahoo! for Amazon: Opinion extraction from small talk on the web. EFA 2001 Barcelona Meetings. Barcelona, Spain, ES. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Yahoo!+for+Am azon:+Opinion+Extraction+from+Small+Talk+on+the+Web#2 Das, S. R., & Chen, M. Y. (2007). Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web. Management Science, 53(9), 13751388. doi:10.1287/mnsc.1070.0704 Das, Sanjiv R, & Chen, M. Y. (2001). Yahoo! for amazon: Sentiment extraction from small talk on the web, 116. Dave, K., & Lawrence, S. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. Proceedings of the 12th. Retrieved from http://dl.acm.org/citation.cfm?id=775226 Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, 51(1), 113. doi:10.1145/1327452.1327492
114
Denecke, K. (2008). Using SentiWordNet for multilingual sentiment analysis. 2008 IEEE 24th International Conference on Data Engineering Workshop, 507512. doi:10.1109/ICDEW.2008.4498370 Dey, L., & Haque, S. (2009). Opinion mining from noisy text data. International journal on document analysis and , 00(iv), 8390. Retrieved from http://www.springerlink.com/index/1265305P655L2357.pdf Diakopoulos, N. a., & Shamma, D. a. (2010). Characterizing debate performance via aggregated twitter sentiment. Proceedings of the 28th international conference on Human factors in computing systems - CHI 10, 1195. doi:10.1145/1753326.1753504 Ding, X., Street, S. M., & Liu, B. (2009). Entity Discovery and Assignment for Opinion Mining Applications, 11251133. Ding, X., Street, S. M., Liu, B., & Yu, P. S. (2008). A Holistic Lexicon-Based Approach to Opinion Mining. Proceedings of the international conference on Web search and web data mining (pp. 231240). New York, NY, USA: ACM. Dini, L., & Mazzini, G. (2002). Opinion classification through information extraction. Intl. Conf. on Data Mining Methods and Databases for Engineering, Finance and Other Fields (Data Mining) (pp. 299310). Retrieved from http://ia2010primercuat.googlecode.com/svn-history/r24/trunk/SEIGO/docs/10.1.1.109.1736.pdf Drost, I., & Ingersoll, G. (2011). Logistic Regression (SGD). cwiki.apache.org. Retrieved December 10, 2012, from https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression Efron, M. (2004). The Liberal Media and Right-Wing Conspiracies: Using Cocitation Information to Estimate Political Orientation in Web Documents, 390398. Eguchi, K., & Kando, N. (2006). Opinion-focused Summarization and its Analysis at DUC 2006. Esuli, A. (2008). Automatic generation of lexical resources for opinion mining: models, algorithms and applications. ACM SIGIR Forum. UNIVERSIT DI PISA. Retrieved from http://dl.acm.org/citation.cfm?id=1480528 Esuli, Andrea. (2008). Automatic generation of lexical resources for opinion mining: models, algorithms and applications. ACM SIGIR Forum, 42(2), 105106. Retrieved from http://dl.acm.org/citation.cfm?id=1480528 Esuli, Andrea, & Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss classification. Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 617624). New York, NY, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1099713
115
Esuli, Andrea, & Sebastiani, F. (2006a). Sentiwordnet: A publicly available lexical resource for opinion mining. Proceedings of LREC (pp. 417422). LREC. Retrieved from http://gandalf.aksis.uib.no/lrec2006/pdf/384_pdf.pdf Esuli, Andrea, & Sebastiani, F. (2006b). Determining term subjectivity and term orientation for opinion mining. Proceedings of EACL (Vol. 2, pp. 193200). Retrieved from http://acl.ldc.upenn.edu/eacl2006/main/papers/13_1_esulisebastiani_192.pdf Esuli, Andrea, & Sebastiani, F. (2007). PageRanking WordNet Synsets: An Application to Opinion Mining. Proceedings of the ACL (pp. 424431). Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of , 25(5). Retrieved from http://textanalysis.googlecode.com/files/Text_Mining_Infrastructure_in_R.pdf Fellbaum, C. (Ed.). (1998). Wordnet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press. Fen, Z., Yabin, X., & Yanping, L. (2012). Research on Internet Hot Topic Detection Based on MapReduce Architecture. 2012 4th International Conference on Intelligent Human-Machine Systems and Cybernetics, 8184. doi:10.1109/IHMSC.2012.26 Friedman, J., Hastie, T., & Tibshirani, R. (2009). The elements of statistical learning (Second Edi.). New York: Springer-Verlag New York Inc. Retrieved from http://www-stat.stanford.edu/~tibs/book/preface.ps Gamon, M. (2004). Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis. Proceedings of the 20th international conference on. Retrieved from http://dl.acm.org/citation.cfm?id=1220476 Gamon, M., & Aue, A. (2005). Pulse: Mining customer opinions from free text. Advances in Intelligent , 121132. Retrieved from http://www.springerlink.com/index/94q1nrhfc8a4e8tn.pdf Gantz, J., & Reinsel, D. (2010). The digital universe decade-are you ready. Publication of IDC (Analyse the Future) Information and , 2009(May), 116. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:The+Digital+Un iverse+Decade++Are+You+Ready+?#0 Gantz, J., & Reinsel, D. (2011). Extracting value from chaos. White Paper, IDC, (June), 112. Retrieved from http://www.itu.dk/people/rkva/2011-FallSMA/readings/ExtractingValuefromChaos.pdf Gloor, P. a., Krauss, J., Nann, S., Fischbach, K., & Schoder, D. (2009). Web Science 2.0: Identifying Trends through Semantic Social Network Analysis. 2009
116
International Conference on Computational Science and Engineering, 215222. doi:10.1109/CSE.2009.186 Godbole, N., & Skiena, S. (2007). Large-Scale Sentiment Analysis for News and Blogs. Goyal, A., Bonchi, F., & Lakshmanan, L. V. S. (2010). Learning influence probabilities in social networks. Proceedings of the third ACM international conference on Web search and data mining - WSDM 10, 241. doi:10.1145/1718487.1718518 Gregory, M. L., Chinchor, N., Whitney, P., Carter, R., Hetzler, E., & Turner, A. (2006). User-directed Sentiment Analysis: Visualizing the Affective Content of Documents. In A. Aue & M. Gamon (Eds.), Proceedings of the Workshop on Sentiment and Subjectivity in Text (pp. 2330). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1654645 Halevy, A., Norvig, P., & Pereira, F. (2009). The Unreasonable Effectiveness of Data. IEEE Intelligent Systems, 24(2), 812. doi:10.1109/MIS.2009.36 Han, J., Kamber, M., & Pei, J. (2006). Data Mining: Concepts and Techniques, Second Edition (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann. Retrieved from http://www.amazon.com/Data-Mining-ConceptsTechniques-Management/dp/1558609016 Harb, A., Planti, M., & Dray, G. (2008). Web Opinion Mining: How to extract opinions from blogs? Proceedings of the 5th , 211217. Retrieved from http://dl.acm.org/citation.cfm?id=1456269 Hatzivassiloglou, V., & McKeown, K. R. (1997). Predicting the semantic orientation of adjectives. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 174181). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/976909.979640 Hatzivassiloglou, V., & Wiebe, J. M. (2000). Effects of adjective orientation and gradability on sentence subjectivity. Proceedings of the International Confer- ence on Computational Linguistics (COLING) (Vol. 1, pp. 299305). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/990820.990864 Hu, M., & Liu, B. (2004a). Mining opinion features in customer reviews. Proceedings of the National Conference on Artificial Intelligence. Retrieved from http://www.aaai.org/Papers/AAAI/2004/AAAI04-119.pdf Hu, M., & Liu, B. (2004b). Mining and Summarizing Customer Reviews. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 168177). New York, NY, USA: ACM. doi:10.1145/1014052.1014073 Hu, P., Huang, M., Xu, P., Li, W., Usadi, A. K., & Zhu, X. (2011). Generating Breakpoint-based Timeline Overview for News Topic Retrospection. 2011 IEEE
117
11th International Conference on Data Mining, 260269. doi:10.1109/ICDM.2011.71 Ingersoll, G. (2012). Algorithms - Apache Mahout. cwiki.apache.org. Retrieved November 20, 2012, from https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms Jamali, S., & Rangwala, H. (2009). Digging Digg: Comment Mining, Popularity Prediction, and Social Network Analysis. 2009 International Conference on Web Information Systems and Mining, 3238. doi:10.1109/WISM.2009.15 Jin, W., Ho, H. H., & Srihari, R. K. (2009). OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction, 11951203. Johnson, C., Shukla, P., & Shukla, S. (n.d.). On Classifying the Political Sentiment of Tweets. cs.utexas.edu. Retrieved from http://www.cs.utexas.edu/~cjohnson/TwitterSentimentAnalysis.pdf Kaiser, C., & Bodendorf, F. (2009). Opinion and Relationship Mining in Online Forums. 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, 128131. doi:10.1109/WI-IAT.2009.25 Kaji, N., & Kitsuregawa, M. (2006). Automatic construction of polarity-tagged corpus from HTML documents. Proceedings of the COLING/ACL Main Conference Poster Sessions (pp. 452459). Retrieved from http://dl.acm.org/ft_gateway.cfm?id=1273132&type=pdf Kalakota, R. (2011). What is a Hadoop ? Explaining Big Data to the C-Suite. Practicalanalytics.wordpress.com. Retrieved January 20, 2012, from http://practicalanalytics.wordpress.com/2011/11/06/explaining-hadoop-tomanagement-whats-the-big-data-deal/ Kamps, J., Marx, M., Mokken, R., & Rijke, M. De. (2004). Using wordnet to measure semantic orientations of adjectives. LREC (pp. 11151118). Retrieved from http://dare.uva.nl/document/154122 Kamvar, S. D., & Harris, J. (2011). We feel fine and searching the emotional web. Proceedings of the fourth ACM international conference on Web search and data mining - WSDM 11, 117. doi:10.1145/1935826.1935854 Kanayama, H., & Nasukawa, T. (2006). Fully automatic lexicon expansion for domainoriented sentiment analysis. Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (pp. 355363). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1610075.1610125 Kawai, Y., Kumamoto, T., & Tanaka, K. (2007). Fair News Reader: Recommending news articles with different sentiments based on user preference. -Based Intelligent Information and , 612622. Retrieved from http://www.springerlink.com/index/ynhr25513w122214.pdf
118
Kessler, B., Numberg, G., & Schtze, H. (1997). Automatic detection of text genre. Proceedings of the 35th ACL/8th EACL (pp. 3238). Retrieved from http://dl.acm.org/citation.cfm?id=979622 Khuc, V. N., Shivade, C., Ramnath, R., & Ramanathan, J. (2012). Towards building large-scale distributed systems for twitter sentiment analysis. Proceedings of the 27th Annual ACM Symposium on Applied Computing (p. 459). New York, New York, USA: ACM Press. doi:10.1145/2245276.2245364 Kim, H., Ganesan, K., Sondhi, P., & Zhai, C. (2011). Comprehensive Review of Opinion Summarization (pp. 130). Urbana, IL, USA: IDEALS. Retrieved from http://www.ideals.illinois.edu/handle/2142/18702 Kim, S.-M., & Hovy, E. (2004). Determining the sentiment of opinions. Proceedings of the 20th international conference on Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1220355.1220555 Kim, SM, & Hovy, E. (2006). Identifying and analyzing judgment opinions. Proceedings of the main conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (pp. 200207). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1220835.1220861 Kim, Soo-min, Hovy, E., Way, A., Rey, M., & Edu, I. S. I. (2007). Crystal: Analyzing Predictive Opinions on the Web, (June), 10561064. Kobayashi, N., Inui, K., & Matsumoto, Y. (2007). Opinion Mining from Web Documents: Extraction and Structurization. Transactions of the Japanese Society for Artificial Intelligence, 22(1), 227238. doi:10.1527/tjsai.22.227 Kraychev, B., & Koychev, I. (2012). Computationally effective algorithm for information extraction and online review mining. Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics (p. 1). New York, New York, USA: ACM Press. doi:10.1145/2254129.2254207 Kroeger, P. (2005). Analyzing grammar: An introduction. Retrieved from http://books.google.com/books?hl=en&lr=&id=rSglHbBaNyAC&oi=fnd&pg=PR1 1&dq=Analyzing+Grammar+-+An+introduction&ots=j9qrqwhNV&sig=TqkTuzcfCYKTeEv_UE5pyYu06eA Ku, L., Liang, Y., & Chen, H. (2006). Opinion Extraction , Summarization and Tracking in News and Blog Corpora, (2001). Ku, L., Lo, Y., & Chen, H. (n.d.). Using Polarity Scores of Words for Sentence-level Opinion Extraction. Kudo, T., & Matsumoto, Y. (2004). A Boosting Algorithm for Classification of SemiStructured Text. Proceedings of EMNLP. Retrieved from http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Kudo2.pdf
119
Kwon, N., Shulman, S., & Hovy, E. (2006). Multidimensional text analysis for eRulemaking. Proceedings of the 2006 international conference on Digital government research (pp. 157166). Digital Government Society of North America. Retrieved from http://dl.acm.org/citation.cfm?id=1146649 Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proceedings of ICML (Vol. 2001, pp. 282289). Retrieved from http://repository.upenn.edu/cis_papers/159/ Laniado, D., & Mika, P. (2010). Making Sense of Twitter, 470485. Lee, D., Jeong, O.-R., & Lee, S. (2008). Opinion mining of customer feedback data on the web. Proceedings of the 2nd international conference on Ubiquitous information management and communication - ICUIMC 08, 230. doi:10.1145/1352793.1352842 Lee, K., Lee, Y., Choi, H., & Chung, Y. (2012). Parallel data processing with MapReduce: a survey. ACM SIGMOD Record, 40(4), 1120. Retrieved from http://dl.acm.org/citation.cfm?id=2094118 Lerman, K., & Ghosh, R. (2010). Information contagion: An empirical study of the spread of news on Digg and Twitter social networks. Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (pp. 9097). Stroudsburg, PA, USA: AAAI. Retrieved from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/viewPDFInterstitial/ 1509/1839 Li, D., Shuai, X., Sun, G., Tang, J., Ding, Y., & Luo, Z. (2012). Mining topic-level opinion influence in microblog. Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM 12, 1562. doi:10.1145/2396761.2398473 Li, N., & Wu, D. D. (2010). Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decision Support Systems, 48(2), 354368. doi:10.1016/j.dss.2009.09.003 Lin, C., Road, N. P., & Ex, E. (n.d.). Joint Sentiment / Topic Model for Sentiment Analysis. Lin, J., & Kolcz, A. (2012). Large-scale machine learning at twitter. Proceedings of the 2012 international conference on Management of Data - SIGMOD 12, 793. doi:10.1145/2213836.2213958 Lincoln, A., Douglas, S. A., & Angle, P. M. (1991). The Complete Lincoln-Douglas Debates of 1858. (P. M. Angle, Ed.) (p. 470). Chicago, IL, USA: University Of Chicago Press.
120
Lippe, W.-M. (2005). Soft-Computing: Mit neuronalen Netzen, Fuzzy-Logic und evolutionren Algorithmen (First Edit., p. 557). Berlin, Heidelberg: Springer Verlag. Liu, B. (2010). Sentiment analysis and subjectivity. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of Natural Language Processing (Second Edi., pp. 138). Retrieved from http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentimentanalysis.pdf Liu, B., Hu, M., & Cheng, J. (2005). Opinion observer: analyzing and comparing opinions on the Web. Proceedings of the 14th international conference on World Wide Web (pp. 342351). New York City, New York, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1060797 Liu, H., Lieberman, H., & Selker, T. (2003). A model of textual affect sensing using real-world knowledge. Proceedings of the 8th international conference on Intelligent user interfaces - IUI 03 (pp. 125132). New York, New York, USA: ACM Press. doi:10.1145/604045.604067 Liu, Y., Huang, X., An, A., & Yu, X. (2007). ARSA: a sentiment-aware model for predicting sales performance using blogs. SIGIR 07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 607614). New York, New York, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1277845 Lu, Y., Zhai, C., & Sundaresan, N. (2009). Rated aspect summarization of short comments. Proceedings of the 18th international conference on World wide web WWW 09, 131. doi:10.1145/1526709.1526728 Manning, C. D., Raghavan, P., & Schtze, H. (2009). An Introduction to Information Retrieval (Online Edi., p. 581). Cambridge, MA, USA: Cambridge University Press. Retrieved from http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity (p. 156). Washington, D.C., USA. Marques, F. J. B., Labra, B. poblete, & Rocha, M. M. (2012). Anlisis temporal de opiniones en twitter. Universidad de Chile. Mei, Q., Ling, X., Wondra, M., Su, H., & Zhai, C. (2007). Topic sentiment mixture: modeling facets and opinions in weblogs. Proceedings of the 16th international conference on World Wide Web (pp. 171180). New York, NY, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1242596 Melville, P., Ox, O., & Lawrence, R. D. (2009). Sentiment Analysis of Blogs by Combining Lexical Knowledge with Text Classification, 12751283.
121
Merriam-Webster. (2012a). Opinion - Definition and More from the Free MerriamWebster Dictionary. Merriam-Webster. Retrieved October 30, 2012, from http://www.merriam-webster.com/dictionary/opinion Merriam-Webster. (2012b). Opinion - Definition and More from the Free MerriamWebster Dictionary. Merriam-Webster Online Dictionary. Mihalcea, R., & Banea, C. (2007). Learning Multilingual Subjective Language via Cross-Lingual Projections, (June), 976983. Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 3941. Mishne, G., & Glance, N. (2006). Predicting movie sales from blogger sentiment. AAAI 2006 Spring Symposium on Computational Approaches to Analyzing Weblogs. AAAI. Retrieved from http://www.aaai.org/Papers/Symposia/Spring/2006/SS-0603/SS06-03-030.pdf?q=who-got-kicked-off-american-idol-may-5 Mishne, G., & Rijke, M. De. (2006a). MoodViews: Tools for blog mood analysis. AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs (pp. 153154). AAAI. Retrieved from http://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-029.pdf Mishne, G., & Rijke, M. De. (2006b). Capturing global mood levels using blog posts. AAAI 2006 Spring Symposium on Computational Approaches to Analysing Weblogs. AAAI. Retrieved from http://www.aaai.org/Papers/Symposia/Spring/2006/SS-06-03/SS06-03-028.pdf Missen, M. M. S., Boughanem, M., & Cabanac, G. (2010). Opinion Detection in Blogs: What Is Still Missing? 2010 International Conference on Advances in Social Networks Analysis and Mining, 270275. doi:10.1109/ASONAM.2010.59 Mitchell, T. (1997). Machine Learning. Annual Review of Computer Science (Vol. 4, pp. 417433). McGraw-Hill. doi:10.1146/annurev.cs.04.060190.002221 Morinaga, S., & Yamanishi, K. (2002). Mining product reputations on the web. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (pp. 341 349). ACM Press. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.74.6816&rep=rep1&typ e=pdf Mullen, T., & Collier, N. (2004). Sentiment analysis using support vector machines with diverse information sources. Proceedings of EMNLP. Retrieved from http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mullen.pdf Nagar, A., & Hahsler, M. (2012). Using Text and Data Mining Techniques to extract Stock Market Sentiment from Live News Streams, XX(Iccts).
122
Nagel, S. (2011). Nutch Tutorial. http://wiki.apache.org/nutch. Retrieved November 3, 2012, from http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website Nasukawa, T., & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural language processing. Second International Conference on Knowledge Capture (KCAP) (pp. 7077). Retrieved from http://dl.acm.org/citation.cfm?id=945658 Nutch Wiki. (2009). Nutch Languageidentifier Plugin. http://wiki.apache.org/nutch. Retrieved November 1, 2012, from http://wiki.apache.org/nutch/LanguageIdentifierPlugin Oelke, D., Hao, M., Rohrdantz, C., Keim, D. a., Dayal, U., Haug, L.-E., & Janetzko, H. (2009). Visual opinion analysis of customer feedback data. 2009 IEEE Symposium on Visual Analytics Science and Technology, 187194. doi:10.1109/VAST.2009.5333919 Ohana, B., & Tierney, B. (2009). Sentiment Classification of Reviews Using SentiWordNet Sentiment Classification of Reviews Using SentiWordNet. Osgood, C. E., Succi, G. J., & Tannenbaum, P. H. (1957). The Measurement of Meaning. Urbana, IL, USA: University of Illinois Press. Owen, S., Anil, R., Dunning, T., & Friedman, E. (2012). Mahout in action. (K. Osborne & A. Carroll, Eds.). Shelter Island, NY, USA: Manning Publishing. Retrieved from http://dl.acm.org/citation.cfm?id=2132656 OConnor, B., Balasubramanyan, R., Routledge, B. R., & Smith, N. A. (2010). From tweets to polls: Linking text sentiment to public opinion time series. Tepper School of Business., (Paper 559). Retrieved from http://www.aaai.org/ocs/index.php/ICWSM/ICWSM10/paper/viewPDFInterstitial/ 1536/1842 Pak, A., & Paroubek, P. (2010). Twitter as a corpus for sentiment analysis and opinion mining. Proceedings of LREC, 13201326. Retrieved from http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-forSentiment-Analysis-and-Opinion-Mining.pdf Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. Proceedings of the 42nd Annual Meeting on . Retrieved from http://dl.acm.org/citation.cfm?id=1218990 Pang, B., & Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. ANNUAL MEETING-ASSOCIATION FOR , (1). Retrieved from http://acl.ldc.upenn.edu/P/P05/P05-1015.pdf Pang, B., & Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2), 1135. doi:10.1561/1500000001
123
Pang, B., Lee, L., & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using machine learning techniques. EMNLP 02 Proceedings of the ACL-02 conference on Empirical methods in natural language processing (Vol. 10, pp. 79 86). Retrieved from http://dl.acm.org/citation.cfm?id=1118704 Parikh, R., & Movassate, M. (2009). Sentiment analysis of user-generated twitter updates using various classification techniques. CS224N Final Report, 118. Retrieved from http://nlp.stanford.edu/courses/cs224n/2009/fp/19.pdf Parrot, W. G. (Ed.). (2000). Emotions in social psychology: Key Readings in Social Psychology (p. 392). Psychology Press. Popescu, A., & Etzioni, O. (2005). Extracting product features and opinions from reviews. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 339346). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1220618 Potthast, M., & Becker, S. (2010). Opinion Summarization of Web Comments, 668 669. Prabowo, R., & Thelwall, M. (2009). Sentiment analysis: A combined approach. Journal of Informetrics, 3(2), 143157. doi:10.1016/j.joi.2009.01.003 Qiu, G., Liu, B., Bu, J., & Chen, C. (2009). Expanding domain sentiment lexicon through double propagation. Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI-09) (pp. 11991204). Retrieved from http://www.aaai.org/ocs/index.php/IJCAI/IJCAI-09/paper/viewFile/313/965 Quirk, R., Greenbaum, S., Leech, G., Svartvik, J., & Crystal, D. (1985). A comprehensive grammar of the English language. Boston, MA, USA: Longman. Retrieved from http://journals.cambridge.org/production/action/cjoGetFulltext?fulltextid=2545152 Rennie, J., Shih, L., Teevan, J., & Karger, D. (2003). Tackling the Poor Assumptions of Naive Bayes Text Classifiers. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003). Washington, D.C., USA. Retrieved from http://www.aaai.org/Papers/ICML/2003/ICML03-081.pdf Riloff, E., Wiebe, J., & Phillips, W. (2005). Exploiting subjectivity classification to improve information extraction. Proceedings of AAAI (pp. 11061111). Retrieved from http://www.aaai.org/Papers/AAAI/2005/AAAI05-175.pdf Riloff, E., Wiebe, J., & Wilson, T. (2003). Learning subjective nouns using extraction pattern bootstrapping. Proceedings of the Conference on Natural Lan- guage Learning (CoNLL) (pp. 2532). Retrieved from http://dl.acm.org/citation.cfm?id=1119180 Russom, B. P. (2011). big data analytics.
124
Sarawagi, S. (2008). Information Extraction, 1(3), 261377. doi:10.1561/1500000003 Seki, Y., Eguchi, K., Kando, N., & Aono, M. (2005). Multi-Document Summarization with Subjectivity Analysis at DUC 2005. Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal Estimated subGrAdient Solver for SVM. Proceedings of the 24th international conference on Machine learning (pp. 807814). New York City, New York, USA: ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1273598 Silva, M. J., Carvalho, P., Sarmento, L., & Magalhes, P. (n.d.). The Design of OPTIMISM , an Opinion Mining System for Portuguese Politics. Song, X., Chi, Y., Hino, K., & Tseng, B. (2007). Identifying opinion leaders in the blogosphere. Proceedings of the sixteenth ACM conference on Conference on information and knowledge management - CIKM 07, 971. doi:10.1145/1321440.1321588 Spertus, E. (1994). Smokey: Automatic Recognition of Hostile Messages. Proceedings of the IAAI. Stonebraker, M., Abadi, D., & DeWitt, D. (2010). MapReduce and Parallel DBmss: friends or foes? Communications of the ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1629197 Tang, H., Tan, S., & Cheng, X. (2009). A survey on sentiment detection of reviews. Expert Systems with Applications, 36(7), 1076010773. doi:10.1016/j.eswa.2009.02.063 Taylor, C. C., Michie, D., & Spiegelhalter, D. J. (1994). Machine Learning , Neural and Statistical Classification. (C. C. Taylor, D. Michie, & D. J. Spiegelhalter, Eds.) (p. 312). Ellis Horwood Ltd. Retrieved from http://www1.maths.leeds.ac.uk/~charles/statlog/ The Apache Software Foundation. (2009). MapReduce Tutorial. hadoop.apache.org. Retrieved December 5, 2012, from http://hadoop.apache.org/docs/mapreduce/current/mapred_tutorial.html#Example: +WordCount+v1.0 The Apache Software Foundation. (2011). Tika language identifier. http://tika.apache.org. Retrieved November 11, 2011, from http://tika.apache.org/1.2/detection.html#Language_Detection The Apache Software Foundation. (2012a). Welcome to Apache TM Hadoop! hadoop.apache.org. Retrieved November 24, 2012, from http://hadoop.apache.org/index.html The Apache Software Foundation. (2012b). About Apache Nutch. nutch.apache.org. Retrieved October 10, 2012, from http://nutch.apache.org/about.html
125
The Apache Software Foundation. (2012c). Apache Mahout: Scalable Machine Learning and Data Mining. mahout.apache.org. Retrieved November 20, 2012, from http://mahout.apache.org/ Thelwall, M., Buckley, K., & Paltoglou, G. (2011). Sentiment in Twitter Events, 62(2), 406418. doi:10.1002/asi Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Zhang, N., Antony, S., et al. (2010). Hive-a petabyte scale data warehouse using hadoop. Data Engineering (ICDE), 2010 IEEE 26th International Conference on (pp. 9961005). IEEE. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5447738 Titov, I., & Mcdonald, R. (2008). Modeling Online Reviews with Multi-grain Topic Models, 115. Tsai, T.-M., Shih, C.-C., Peng, T.-C., & Chou, S.-C. T. (2009). Explore the Possibility of Utilizing Blog Semantic Analysis for Domain Expert and Opinion Mining. 2009 International Conference on Intelligent Networking and Collaborative Systems, 241244. doi:10.1109/INCOS.2009.63 Tsytsarau, M., Palpanas, T., & Denecke, K. (2010). Scalable discovery of contradictions on the web. Proceedings of the 19th international conference on World wide web WWW 10, 1195. doi:10.1145/1772690.1772871 Turney, P. (2002). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of the Association for Computational Linguistics (pp. 417424). Retrieved from http://dl.acm.org/citation.cfm?id=1073153 Twitter Inc. (2012a). Twitter Streaming API. Retrieved November 15, 2012, from https://dev.twitter.com/docs/streaming-apis Twitter Inc. (2012b). Twitter REST API. twitter.com. Retrieved November 10, 2012, from https://dev.twitter.com/docs/api Twitter Inc. (2012c). Using Twitter Search API. twitter.com. Retrieved November 10, 2012, from https://dev.twitter.com/docs/using-search Velikovich, L., & Blair-Goldensohn, S. (2010). The viability of web-derived polarity lexicons. Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 777785). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=1858118 Vom Brocke, J., Simons, A., & Niehaves, B. (2009). Reconstructing the giant: on the importance of rigour in documenting the literature search process. 17th European Conference on Information Systems (pp. 113).
126
Wang, X., Wei, F., Liu, X., Zhou, M., & Zhang, M. (2011). Topic Sentiment Analysis in Twitter: A Graph-based Hashtag Sentiment Classification Approach, 1031 1040. Wanner, F., Rohrdantz, C., & Mansmann, F. (2009). Visual sentiment analysis of rss news feeds featuring the us presidential election in 2008. Workshop on Visual Interfaces to the Social and the Semantic Web (VISSW}. Retrieved from http://smart-ui.org/events/vissw2009/papers/VISSW2009-Wanner.pdf Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. Management Information Systems Quarterly, 26(2), 3. White, T. (2012). Hadoop: The definitive guide. (M. Loukides & M. Blanchette, Eds.) (Third Edit.). Sebastopol, CA, USA: OReilly Media. Retrieved from http://books.google.com/books?hl=en&lr=&id=Nff49D7vnJcC&oi= fnd&pg=PR5&dq=Hadoop++The+Definitive+Guide&ots=IieyWpaAPs&sig=qLlJO3OnI5_xQpcLlY bKx9mBTBg Whitelaw, C., Garg, N., & Argamon, S. (2005). Using appraisal groups for sentiment analysis. Proceedings of the 14th ACM international conference on Information and knowledge management - CIKM 05, 625. doi:10.1145/1099554.1099714 Wiebe, Janyce. (2006). Word Sense and Subjectivity, (July), 10651072. Wiebe, Janyce, Wilson, T., & Cardie, C. (2005). Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, (2003), 150. Retrieved from http://www.springerlink.com/index/E854800108051U61.pdf Wiebe, JM. (1994). Tracking point of view in narrative. Computational Linguistics, 20(2), 233287. Retrieved from http://dl.acm.org/citation.cfm?id=972529 Wiebe, JM, Bruce, R., & OHara, T. (1999). Development and use of a gold-standard data set for subjectivity classifications. Proceedings of the Association for Computational Linguistics (pp. 246253). ACL. Retrieved from http://dl.acm.org/citation.cfm?id=1034721 Wiegand, M., & Klakow, D. (2008). The Role of Knowledge-based Features in Polarity Classification at Sentence Level. Wilks, Y., & Stevenson, M. (1998). The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation. Natural Language Engineering, 4(1), 135 144. Retrieved from http://journals.cambridge.org/production/action/cjoGetFulltext?fulltextid=48444 Wilson, T., Wiebe, J., & Hoffmann, P. (2005). Recognizing contextual polarity in phrase-level sentiment analysis. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 347354). Morristown, NJ, USA: Association for Computational Linguistics. doi:10.3115/1220575.1220619
127
Wilson, T., Wiebe, J., & Hoffmann, P. (2009). Recognizing contextual polarity: An exploration of features for phrase-level sentiment analysis. Computational Linguistics, 35(3), 399433. Retrieved from http://www.mitpressjournals.org/doi/abs/10.1162/coli.08-012-R1-06-90 Xia, Y., & Xu, R. (2007). The unified collocation framework for opinion mining. Machine Learning and ..., (August), 1922. Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=4370260 Xu, K., Liao, S. S., Li, J., & Song, Y. (2011). Mining comparative opinions from customer reviews for Competitive Intelligence. Decision Support Systems, 50(4), 743754. doi:10.1016/j.dss.2010.08.021 Yerva, S., Jeung, H., & Aberer, K. (2012). Cloud based social and sensor data fusion. Information Fusion (FUSION) (pp. 24942501). Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6290607 Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003). Sentiment analyzer: extracting sentiments about a given topic using natural language processing techniques. Third IEEE International Conference on Data Mining, 427434. doi:10.1109/ICDM.2003.1250949 Zafarani, R., Cole, W. D., & Liu, H. (2010). Sentiment Propagation in Social Networks: A Case Study in LiveJournal, 413420. Zeng, D. (2009). Finding leaders from opinion networks. 2009 IEEE International Conference on Intelligence and Security Informatics, 266268. doi:10.1109/ISI.2009.5137323 Zhang, J., Jin, R., Yang, Y., & Hauptmann, A. (2003). Modified logistic regression: An approximation to svm and its applications in large-scale text categorization. Proceedings of the 20th international conference on machine learning. AAAI. Retrieved from http://www.aaai.org/Papers/ICML/2003/ICML03-115.pdf Zhuang, L., Jing, F., & Zhu, X.-Y. (2006). Movie review mining and summarization. Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM 06, 43. doi:10.1145/1183614.1183625
128
Appendix
This section contains additional tables and media.
First page intentionally left blank.
129
Overview of Classification Approaches
Table 10: Overview of Classification Approaches

Legend Output: P2 = two polarity classes, Pn = n polarity classes; Features: 1G = unigram, nG = nGram;Feat. Encoding: NFREQUENCY = Normalized frequency; Precision/Recall: *= calculated; F1: **=macroaverage;Evaluation:RS = Random Sampling, CVn = n-fold crossvalidation; Best: = Best in Paper
Feat. Encoding
Dataset size
Evaluation
Precision
Classifier
Accuracy
Features
Output
Name
Recall
Paper
Type
(Xia & Xu, 2007)
UCD AD SD GCD UCD-NoSubj
Pattern-DICTIONARY-SCORER Pattern- DICTIONARY-SCORER Pattern-DS Pattern-DICTIONARY-SCORER Pattern-DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER
P5 P5 P5 P5 P5 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4 P4
2G 2G 2G 2G 2G 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G 1G
FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY FREQUENCY
REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS NEWS
1000 1000 1000 1000 1000 99 99 99 99 99 99 99 99 99 99 99 99
0,842 0,843 0,846 0,844 0,679
0,879 0,872 0,857 0,903 0,703 0,586* 0,515* 0,485* 0,546* 0,449* 0,720* 0,542* 0,515* 0,485* 0,494* 0,434* 0,661*
0,802 0,573 0,623 0,476 0,785 0,612* 0,278* 0,262* 0,550* 0,440* 0,682* 0,545* 0,263* 0,248* 0,501* 0,435* 0,661*
0,839 0,692 0,722 0,623 0,742
RS RS RS RS RS RS RS RS RS RS RS RS RS RS RS RS RS
(Balahur et al., 2009)
JRCLists SentiWN WNAffect MicroWN SentiWN+WNAffe ct All JRCLists SentiWN WNAffect MicroWN SentiWN+WNAffe ct All
Best
F1
130
ISEAR Emotiblog (Whitelaw et al., 2005) W:A S:A S:AO G:A G:AO G:AOF BOW BOW+G:AO BOW+G:AOF W:A S:A S:AO G:A G:AO G:AOF BOW BOW+G:AO BOW+G:AOF (Ohana & Tierney, 2009) Sentiwordnet SW Features SWF + LWA SWF +LWS+N SWF+All+FS (Prabowo & Thelwall, 2009) Hybrid Hybrid Hybrid Hybrid Corpus-Similarity SVM SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon DICTIONARY-SCORER SVM + Lexicon SVM + Lexicon SVM + Lexicon SVM + Lexicon RULE + STAT + SVM RULE + STAT + SVM RULE + STAT + SVM RULE + STAT + DICTIONARY + SVM P4 P4 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 Sentence G1,2,3,4 + Sent. ST ST ST ST ST ST 1G 1G + ST 1G + ST ST ST ST ST ST ST 1G 1G + ST 1G + ST 1G Score Score Score Score 1G 1G 1G 1G SIMILARITY FREQUENCY + SIMILARITY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY NFREQUENCY FORMULA FORMULA FORMULA FORMULA FORMULA NEWS BLOG REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS COMMEN TS REVIEWS 200 360 220 200 1950 1950 1950 1950 1950 1950 1950 1950 1950 0,777 0,778 0,774 0,778 0,78 0,779 0,879 0,901 0,896 0,776 0,78 0,782 0,782 0,786 0,783 0,87 0,902 0,887 0,6585 0,674 0,68 0,685 0,6935 0,908** 0,826** 0,899** 0,888** 99 0,514 0,511* 0,602* 0,541* 0,545* 0,494 RS RS RS RS RS RS RS RS RS RS RS CV10 CV10 CV10 CV10 CV10 CV10 CV10 CV10 CV10 CV3 CV3 CV3 CV3 CV3 CV10 CV10 CV10 CV10
131
Hybrid Hybrid SVM SVM SVM SVM (Pang et al., 2002) RULE + STAT + DICTIONARY + SVM RULE + STAT + DICTIONARY + SVM SVM SVM SVM SVM NB NB NB NB NB NB NB NB ME ME ME ME ME ME ME SVM SVM SVM SVM SVM SVM SVM P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 P2 1G 1G 1G 1G 1G 1G 1G 1G 1G+2G 2G 1G 1G 1G 1G+Positio n 1G 1G+2G 2G 1G 1G 1G 1G+Positio n 1G 1G 1G+2G 2G 1G 1G 1G PRESENCE PRESENCE PRESENCE PRESENCE FREQUENCY PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE+RE ALV PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE+RE ALV FREQUENCY PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE PRESENCE REVIEWS COMMEN TS REVIEWS REVIEWS REVIEWS COMMEN TS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS 360 220 2000 200 360 220 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 1400 0,787 0,81 0,806 0,773 0,815 0,77 0,803 0,81 0,804 0,808 0,774 0,804 0,777 0,81 0,801 0,728 0,829 0,827 0,771 0,819 0,751 0,814 0,833** 0,904** 0,873** 0,753** 0,559** 0,839** CV10 CV10 CV10 CV10 CV10 CV10 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3 CV3
132
SVM (Choi & Cardie, 2008) Heuristic Vote Heuristic NegEx Heuristic CMC Heuristic CPR SC Vote DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER DICTIONARY-SCORER SVM + Lexicon P2 P2 P2 P2 P2 P2 1G+Positio n 1G 1G 1G 1G 1G + WordClass es + Score 1G + WordClass es + Score 1G + WordClass es + Score 1G + WordClass es + Score 1G 1G 1G 1G 1G 1G 1G PRESENCE+RE ALV FREQUENCY FREQUENCY FORMULA FORMULA PRESENCE + REAL PRESENCE + REAL FREQUENCY REVIEWS NEWS NEWS NEWS NEWS NEWS 1400 400 400 400 400 400 0,816 0,865 0,877 0,897 0,894 0,885 CV3 CV10 CV10 CV10 CV10 CV10
SC NegEx
SVM + Lexicon
P2
NEWS
400
0,891
CV10
CCI CMC
SVM + Lexicon
P2
NEWS
400
0,906
CV10
CCI CPR
SVM + Lexicon
P2
FREQUENCY
NEWS
400
0,907
CV10
(Mullen & Collier, 2004)
Turney Osgood Turney + Osgood Unigrams Unigrams + Osgood Unigrams + Turney Unigrams + Turney + Osgood Lemmas Lemmas + Osgood Lemmas + Turney Lemmas + Turney + Osgood Hybrid T+L Hybrid T+O+L Turney
SVM + PMI SVM + Lexicon SVM + PMI + Lexicon SVM SVM + Lexicon SVM + PMI SVM + PMI + Lexicon
P2 P2 P2 P2 P2 P2 P2
PMI OSGOOD PMI + OSGOOD PRESENCE PRESENCE + OSGOOD PRESENCE + PMI PRESENCE + PMI + OSGOOD PRESENCE PRESENCE + OSGOOD PRESENCE + PMI PRESENCE + OSGOOD FORMULA FORMULA PMI
REVIES REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS
1380 1380 1380 1380 1380 1380 1380
0,683 0,564 0,687 0,835 0,835 0,851 0,851
CV10 CV10 CV10 CV10 CV10 CV10 CV10
SVM SVM + Lexicon SVM + PMI SVM + PMI + Lexicon SVM + PMI SVM + PMI + Lexicon SVM P + PMI
P2 P2 P2 P2 P2 P2 P2
1G 1G 1G 1G 1G 1G 1G
REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS
1380 1380 1380 1380 1380 1380 100
0,857 0,847 0,849 0,845 0,86 0,86 0,73
CV10 CV10 CV10 CV10 CV10 CV10 CV10
133
All PMI THIS_WORK PMI All Osgood All PMI + Osgood Unigrams Unigrams + PMI + Osgood Lemmas Lemmas + Osgood Lemmas + Turney Lemmas + Turney + Osgood Lemmas + PMI + Osgood Lemmas + PMI Hybrid SVM NB1 (Parikh & Movassate, 2009) NB2 NB3 NB4 NB5 NB6 NB7 NB8 SVM +PMI SVM + PMI SVM SVM + PMI + Lexicon SVM + PMI SVM + PMI + Lexicon P2 P2 P2 P2 P2 P2 1G 1G 1G 1G 1G 1G PMI PMI OSGOOD PMI + OSGOOD PRESENCE PRESENCE + PMI + OSGOOD PRESENCE PRESENCE + OSGOOD PRESENCE + PMI PRESENCE + PMI + OSGOOD PRESENCE + PMI + OSGOOD PRESENCE + PMI FORMULA FREQUENCY PRESENCE FREQUENCY PRESENCE FREQUENCY PRESENCE FREQUENCY PRESENCE REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS REVIEWS 100 100 100 100 100 100 0,7 0,69 0,64 0,71 0,8 0,8 CV10 CV10 CV10 CV10 CV10 CV10
SVM SVM + Lexicon SVM + PMI SVM + PMI + Lexicon
P2 P2 P2 P2
1G 1G 1G 1G
REVIEWS REVIEWS REVIEWS REVIEWS
100 100 100 100
0,85 0,84 0,85 0,85
CV10 CV10 CV10 CV10
SVM + PMI + Lexicon
P2
1G
REVIEWS
100
0,85
CV10
SVM + PMI + Lexicon SVM + PMI + Lexicon NB NB NB NB NB NB NB NB
P2 P2 P2 P2 P2 P2 P2 P2 P2 P2
1G 1G 1G 1G 1G 1G 1G 1G 1G 1G
REVIEWS REVIEWS TWITTER TWITTER TWITTER TWITTER TWITTER TWITTER TWITTER TWITTER
100 100 640 640 640 640 640 640 640 640
0,85 0,87 0,857 0,85 0,834 0,825 0,829 0,833 0,78 0,731
CV10 CV10 RS RS RS RS RS RS RS RS
134
Hadoop
Table 11: WordCount Example76

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.*; import org.apache.hadoop.mapreduce.lib.output.*; import org.apache.hadoop.util.*; // Driver public class WordCount extends Configured implements Tool { // Mapper class public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); // map function public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } // Reducer class public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { // reduce function public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } // driver method public int run(String [] args) throws Exception { // Configuration of the job object Job job = new Job(getConf()); job.setJarByClass(WordCount.class); job.setJobName("wordcount"); // Setting of key type and value type (Output) job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // Setting Mapper, Combiner, and Reducer implementations job.setMapperClass(Map.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class);
76
From the Hadoop MapReduce Tutorial (The Apache Software Foundation, 2009); Comments added.
135
// The filetype of the input file(s) job.setInputFormatClass(TextInputFormat.class); // Filetype of the output files job.setOutputFormatClass(TextOutputFormat.class); // Input and Output path FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // Run job boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int ret = ToolRunner.run(new WordCount(), args); System.exit(ret); } }
136
C
Type
Mahout Algorithms
Algorithm Implementation Execution Model Size of Training Set (records) Logistic Regression (SGD) Integrated Sequential, Online, Incremental (Complentary Nave) Bayesian Nave / Integrated Parallel Less than tens of millions Millions hundreds millions Support Vector (SVM) Classification Perceptron/Winnow Neural Network Random Forest (RF) Open Open Integrated Parallel Unknown Parallel Less than tens of millions Restricted Machines Online Passive Aggressive Boosting Integrated Awaiting commit Hidden (HMM) Canopy Clustering K-Means Clustering Fuzzy K-Means Expectation Clustering (EM) Mean Shift Clustering Dirichlet Process Clustering Latent Dirichlet Allocation Spectral Clustering Minhash Clustering Top Down Clustering Integrated Integrated Integrated Integrated Integrated Integrated Sequential, Parallel Sequential, Parallel Sequential, Parallel Sequential, Parallel Sequential, Parallel Sequential, Parallel Maximization Integrated Integrated Integrated Unknown Sequential, Parallel Sequential, Parallel Sequential, Parallel Unknown Markov Models Integrated Unknown patch Unknown Unknown Boltzmann Open Unknown Machines Open Sequential Less than tens of millions to of
Table 12: Classification and Clustering Algorithms of Mahout Current state as of December 2012; Information collected from (Ingersoll, 2012; Owen et al., 2012). Algorithms used in this thesis are highlighted in boldface.
137
PipelineML
<?xml version="1.0" encoding="UTF-8"?> <schema xmlns="http://www.w3.org/2001/XMLSchema" targetNamespace="http://de.uni-muenster.wi.dsteffen.dox/PipelineML" xmlns:tns="http://de.uni-muenster.wi.dsteffen.dox/PipelineML" elementFormDefault="qualified">  <element name="pipeline"> <complexType> <sequence>  <element name="id" type="string" minOccurs="1" maxOccurs="1" />  <element name="className" type="string" minOccurs="1" maxOccurs="1" />  <element name="properties" minOccurs="0" maxOccurs="1"> <complexType> <sequence> <element ref="tns:property" minOccurs="0"></element> </sequence> </complexType> </element>  <element name="containedClasses" minOccurs="0" maxOccurs="1"> <complexType> <sequence> <element ref="tns:pipeline" minOccurs="0"></element> </sequence> </complexType> </element> </sequence> </complexType> </element>  <element name="property"> <complexType> <sequence> <element name="key" type="string" minOccurs="1" maxOccurs="1"/> <element name="value" type="string" minOccurs="1" maxOccurs="1" /> </sequence> </complexType> </element> </schema>
Script 1 PipelineML Schema
138
Evaluation
Table 13: MPQA Translation file

GATE_expressive-subjectivity GATE_direct-subjective intensity="high" polarity="negative" 0 GATE_expressive-subjectivity GATE_direct-subjective intensity="medium-high" polarity="negative" GATE_expressive-subjectivity GATE_direct-subjective intensity="medium" polarity="negative" GATE_expressive-subjectivity GATE_direct-subjective intensity="low-medium" polarity="negative" GATE_expressive-subjectivity GATE_direct-subjective intensity="low" polarity="negative" 0 GATE_expressive-subjectivity GATE_direct-subjective intensity="high" polarity="positive" 1 GATE_expressive-subjectivity GATE_direct-subjective intensity="medium-high" polarity="positive" GATE_expressive-subjectivity GATE_direct-subjective intensity="medium" polarity="positive" GATE_expressive-subjectivity GATE_direct-subjective intensity="low-medium" polarity="positive" GATE_expressive-subjectivity GATE_direct-subjective intensity="low" polarity="positive" 0.2 GATE_expressive-subjectivity GATE_direct-subjective intensity="high" polarity="neutral" 0 GATE_expressive-subjectivity GATE_direct-subjective intensity="medium-high" polarity="neutral" GATE_expressive-subjectivity GATE_direct-subjective intensity="medium" polarity="neutral" GATE_expressive-subjectivity GATE_direct-subjective intensity="low-medium" polarity="neutral" GATE_expressive-subjectivity GATE_direct-subjective intensity="low" polarity="neutral" 0 GATE_attitude attitude-type="arguing-neg" intensity="high" 0 1 0 GATE_attitude attitude-type="arguing-neg" intensity="medium-high" 0 0.8 GATE_attitude attitude-type="arguing-neg" intensity="medium" 0 0.6 0 GATE_attitude attitude-type="arguing-neg" intensity="low-medium" 0 0.4 GATE_attitude attitude-type="arguing-neg" intensity="low" 0 0.2 0 GATE_attitude attitude-type="arguing-pos" intensity="high" 1 0 0 GATE_attitude attitude-type="arguing-pos" intensity="medium-high" 0.8 0 GATE_attitude attitude-type="arguing-pos" intensity="medium" 0.6 0 0 GATE_attitude attitude-type="arguing-pos" intensity="low-medium" 0.4 0 GATE_attitude attitude-type="arguing-pos" intensity="low" 0.2 0 0 GATE_attitude attitude-type="sentiment-neg" intensity="high" 0 1 0 GATE_attitude attitude-type="sentiment-neg" intensity="medium-high" 0 0.8 GATE_attitude attitude-type="sentiment-neg" intensity="medium" 0 0.6 GATE_attitude attitude-type="sentiment-neg" intensity="low-medium" 0 0.4 GATE_attitude attitude-type="sentiment-neg" intensity="low" 0 0.2 0 GATE_attitude attitude-type="sentiment-pos" intensity="high" 1 0 0 GATE_attitude attitude-type="sentiment-pos" intensity="medium-high" 0.8 0 GATE_attitude attitude-type="sentiment-pos" intensity="medium" 0.6 0 GATE_attitude attitude-type="sentiment-pos" intensity="low-medium" 0.4 0 GATE_attitude attitude-type="sentiment-pos" intensity="low" 0.2 0 0 GATE_attitude attitude-type="agreement-neg" intensity="high" 0 1 0 GATE_attitude attitude-type="agreement-neg" intensity="medium-high" 0 0.8 GATE_attitude attitude-type="agreement-neg" intensity="medium" 0 0.6 GATE_attitude attitude-type="agreement-neg" intensity="low-medium" 0 0.4 GATE_attitude attitude-type="agreement-neg" intensity="low" 0 0.2 0 GATE_attitude attitude-type="agreement-pos" intensity="high" 1 0 0 GATE_attitude attitude-type="agreement-pos" intensity="medium-high" 0.8 0 GATE_attitude attitude-type="agreement-pos" intensity="medium" 0.6 0 GATE_attitude attitude-type="agreement-pos" intensity="low-medium" 0.4 0 GATE_attitude attitude-type="agreement-pos" intensity="low" 0.2 0 0 GATE_objective-speech-event * 0 0 1 1 0 0 0 0.2 0 0.8 0.6 0.4 0 0 0 0 0 0 0 0 0 0.8 0.6 0.4 0 0 0 0 0 0 1 0 0 0 0.2 0 0 0
0 0 0
0.8 0.6 0.4
0 0
0 0 0
0 0 0
0 0 0
0 0 0
Explanation: annotation marks on the left hand side will be translated into a vector (Pos, Neg, Obj) with the values on the right hand side. I.e.: GATE_expressive-subjectivity GATE_direct-subjective intensity="high" polarity="negative" => (0,1,0)
139
Acknowledgments
Academic work relies on many visible and invisible contributions from lovely people. I would like to take the time to acknowledge and thank those whose contributions and support has made this thesis possible. First and foremost, I would like to thank my parents and my brother for their continuing support and encouragement throughout the course of my studies. At the University of Mnster, I would like to thank Prof. Dr. rer. nat. Gottfried Vossen and my adviser Dr. rer. nat. Jens Lechtenbrger for their time, support, and thoughtful feedback. My thanks also go out to the Informationsfabrik and its staff for their support! I would like to especially thank Thomas Lchte, without whom this thesis would not have been possible, and Dr. Bodo Hsemann for his tremendous support as my adviser. Finally, I feel that I have to acknowledge the Apache Software Foundation and thank their contributors, committers, and supporters. The open-source software projects curated by the foundation are an immense asset to society and a major contribution to this thesis. Thank you all!
140
Declaration of Authorship
I hereby declare that, to the best of my knowledge and belief, this Master-Thesis titled Parallelized Analysis of Opinions and their Diffusion in Online Sources is my own work. I confirm that each significant contribution to and quotation in this thesis that originates from the work or works of others is indicated by proper use of citation and references.

Parallelized Analysis of Opinions and Their Diffusion in Online Sources

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Parallelized Analysis of Opinions and Their Diffusion in Online Sources

Transféré par

Droits d'auteur :

Formats disponibles

Title: Parallelized Analysis of Opinions and their Diffusion in Online Sources

Master Thesis at the Chair for Computer Science

Supervisor: Tutor: Presented by:

Date of Submission: 2013/01/08

Quote from (Mitchell, 1997, p. 2)

E.g. positive or negative classes in the binary case

Turneys symbols for these classes

Also known as simple Bayesian classifier (Han et al., 2006)

has to be determined by fitting, i.e.

Adapted from (Friedman et al., 2009, p. 119)

Formula 1 Kamps et al.'s EVA metric25 (w is the unlabeled word) ( ) ( ( ( ) ) ( ( ) ) )

See (Ding et al., 2008, p. 236)

Evaluation of classification techniques Evaluation Dimensions

Table 1: Accuracy measures

Feature Sentiment Classification

http://twitter.com/ http://www.reuters.com/ 34 http://www.nytimes.com/ 35 http://www.imdb.com/ 36 http://www.yelp.com/ 37 http://www.myspace.com/ 38 http://www.youtube.com/

For more, see https://dev.twitter.com/docs/api/1.1/post/statuses/filter

Subjectivity Classification Objective Subjective

Sentiment Classification Positive Negative

Or topic filtering (Pang & Lee, 2008)

Opinion Mining Framework

Opinionholder Extraction Opinion holder

Subject / -Feature Extractor Subject

Metadata Extraction Source Time Location

Figure 8: Opinion Mining Framework

Subject / -Feature Extractor

Tracking, Diffusion, and Summarization

Adapted from (P. Hu et al., 2011, p. 261)

Open research questions

(Kaiser & Bodendorf, 2009, p. 130)

Towards parallel execution

Hadoop Releases: http://hadoop.apache.org/releases.html

Figure 17: Infrastructure of a Hadoop cluster 5.3.2 Nutch

http://nutch.apache.org/ http://lucene.apache.org/core/ 53 http://lucene.apache.org/solr/

Parallelized Opinion Mining

The label most classifiers submitted.

Assignment of random number keys

Figure 19: Using reducers to train ensembles 5.4.3 Preprocessing

CrawlerExecutionEngine +CrawlerExecutionEngine() +executeJobs(): void +submitJobs(Collection<TrackingJob> jobs): void

AnalyserExecutionEngine +AnalyserExecutionEngine() +executeJobs(): void +submitJobs(Collection<TrackingJob> jobs): void

CrawlerThread +CrawlerThread(Collection<TrackingJob> activeJobs, Configuration conf, StatusEventListener listener)

AnalyserThread +AnalyserThread(String segment, String queryTermsFile, Configuration conf)

Figure 22: Thread classes 6.3 Collection

Type User handles @(\\S)+

URLs Line breaks Whitespace

((mailto\\:|(news|(ht|f)tp(s?))\\://){1}\\S+) Remove \\r?\\n (\\s)+

Leading whitespace Trailing whitespace

Table 3: Sanitizing twitter status updates

Property name userid username

<<interface>> org.apache.hadoop.conf.Configurable +getConf(Configuration conf) +setConf(Configuration conf)

<<interface>> ExecutableStage +execute(): void

<<interface>> TrainableStage +train(): void

Default Value Empty

<<interface>> TrainableStage +train(): void

<<interface>> MultiStagePipeline +getStages(): Collection<PipelineStage> +setStages(Collection<PipelineStage> stages): void

<<interface>> ExecutableStage +execute(): void

Figure 26: Pipeline assembly 6.4.2 Sentiment Analysis

Figure 27: Generic Types 6.4.3 Preprocessing

org.apache.lucene.analysis.Analyzer org.apache.mahout.math.VectorWritable 65 org.apache.mahout.math.Vector

Source-Type Group Source Document

org.apache.mahout.math.NamedVector org.apache.mahout.clustering.kmeans.RandomSeedGenerator 72 org.apache.mahout.clustering.kmeans.KMeansDriver

<Text: DocumentSentenceId, Text: Sentence>

<Text: DocumentSentenceId, VectorWritable:TFIDF Vector>