(CS283MiniProject) Report - Sambayan and Satuito

Sentiment Analysis of Twitter Data on the Chief Justice Renato Corona Impeachment Trial CS 283 Mini-Project
Leo Isiah D. Sambayan leoisiah@yahoo.com 2. Arlene A. Satuito lene.satuito@gmail.com LITERATURE REVIEW
ABSTRACT
In this paper, we describe the development of our CS 283-Data Mining project. The project is about the sentiment analysis of a current issue using data gathered from Twitter.
There have been many papers written on sentiment analysis for the domain of microblogs, blogs and product reviews, and so we used some of these papers in making our project. Bolen, Mao, and Pepe [1], performed a sentiment analysis of all tweets published on the microblogging platform Twitter in the second half of 2008 using a psychometric instrument to extract six mood states namely tension, depression, anger, vigor, fatigue and confusion, from the aggregated Twitter content and compute a six-dimensional mood vector for each day in the timeline. They compared the results of their sentiment analysis to a record of popular events gathered from media and sources and found that events in the social, political, cultural and economic sphere do have a significant, immediate and highly specific effect on the various dimensions of public mood. They speculated that large scale analyses of mood can provide a solid platform to model collective emotive trends in terms of their predictive value with regards to existing social as well as economic indicators. Pak and Paroubek [2], also used Twitter on sentiment analysis, showed how to automatically collect a corpus for sentiment analysis and opinion mining purposes. They performed linguistic analysis of the collected corpus and explained discovered phenomena. Using the corpus, they built a sentiment classifier using the multinomial Nave Bayes classifier that was able to determine positive, negative and neutral sentiments from a document. Although, in their research they worked with English language, they claimed that their proposed technique can be used with any other language. On the work by Ellis [3], he developed a word-counting classification algorithm in Python which utilises the LIWC2007 (Linguistic Inquiry and Word Count) [4] word set in order to classify a users corpus of tweets as positive or negative along with a measure of extent. The simple method employs a simple word-counting algorithm whereby the word in each text fragment (in this case, tweet) is compared against a list of positive and negative words. The occurrence of each type of word is counted, and those tweets which contain more positive words than negative are classified as positive, and vice-versa for negative words. Each tweet is tokenized and compared word-for-word against the words and prefixes in the two super-categories. Counts are kept and the tweet is classified to the dominant class. In addition, the algorithm takes into consideration a number of popular emoticons. The presence of emoticons overrides word-based classification, i.e., if a tweet contains one or more positive emoticons and one or more negative emoticons, it is immediately classified as 0 for neutral. If it contains one or more positive emoticons only, it is classified as 1, and if it contains one or more negative emoticons only, it is classified as -1. It said to have improved the accuracy of the
Keywords
Data Mining, Sentiment Analysis, Twitter sentiment analysis.
1.
INTRODUCTION
Tweets, also known as microblogs, is a popular form of communication on the web. The Twitter website which was launched in October of 2006 is responsible for the popularization of this form of communication. Tweets, allow users to broadcast brief text with at most 140 characters in length to the public or to a selected group of contacts. These tweets can convey information about the mood state of their authors. Mood expressions in a tweet can be evident through an explicit sharing of subjectivity, e.g. I hate CJ Corona clearly indicates a negative feeling. Also, implicitly the message can reflect the user's mood, e.g. The senators in the impeachment trial are amazing, suggests a positive sentiment. There are also tweet posts like CJ Corona will not testify which are considered as neutral or objective and does not contain sentiment. Considering the positive and negative sentiments that can be deduced, tweets may be regarded as microscopic instantiations of mood. It follows that the collection of all tweets published over a given time period can unveil changes in the state of public mood at a larger scale. Classifying the polarity of a given text, whether it conveys positive or negative mood is the basic task under sentiment analysis. Sentiment analysis is an application in natural language processing (NLP), computational linguistics and text analytics that deals with the computational treatment of opinion, sentiment, and subjectivity in text. Automated sentiment analysis can be performed by computers using various machine learning algorithms such as Nave Bayes, Support Vector Machine and Maximum Entropy, or bag-of-words model. In this mini-project, sentiment analysis will be carried out on some tweets posted on Twitter regarding a very current issue in our nation today- the impeachment trial of the current Philippines Supreme Court Chief Justice Renato C. Corona. Trends on the number of positive and negative tweets on the subject of the impeachment trial proceedings and not necessarily on Chief Justice Corona are to be examined and causes of the changes in these trends may be determined.
classifier as emoticons are generally far better indications of mood than interpretations of natural language. In this project, the word-counting classification algorithm but utilising the AFINN lexicon was tried out and another classifier that made some modifications to Ellis algorithm and uses the AFINN list of English words having corresponding valence. AFINN [5] is a list of English words rated for valence with an integer between minus five (negative) and plus five (positive). The words have been manually labeled by Finn rup Nielsen in 2009-2011.This sentiment lexicon has 2477 English words (including a few phrases) each labeled with a sentiment strength and targeted towards sentiment analysis on short text as one finds in social media. Nielsen used it in his studies, some of these are Twitter text mining for sentiment analysis of COP15 [6], "A new ANEW: Evaluation of a word list for sentiment analysis in microblogs" [5], and Good Friends, Bad News - Affect and Virality in Twitter [7].
are replaced with only two of the said letters (e.g. happppy is replaced with happy). Finally, contractions are expanded (e.g. theyre becomes they are). This process was implemented using Perl and used matching of regular expressions to delete unnecessary punctuation marks and common words. After performing this operation of data cleanup, the total number of tweets was reduced from 69,346 to 13,998. The cleaned data are then imported into a MySQL database for processing of the sentiment classifier.
4.3
Sentiment Classifier
In the literature there have been so many approaches to sentiment classification. Due to time constraints, we preferred to adopt for now the simplest of methods which are based on the bag-of-words model and forego the machine learning-based classifiers since it would entail more time for training and testing.
3.
SIGNIFICANCE OF THE PROJECT
4.3.1 Simple Word-Counting Classification Algorithm

This classifier is mainly based on the simple word-counting algorithm developed by Ellis [3]. In this classifier, we made two lists corresponding to the set of positive words (Lp) and negative words (Ln). The English words from these two lists were taken from the AFINN lexicon and some Tagalog/Filipino words were added on both lists. Retrieving the records of tweets from the database, each tweet (Ti) was tokenized, splitting on the space token. Each token is crossreferenced against Lp and Ln, then two sets were produced from the classifier, Pi and Ni which contain all of the positively- and negatively-classifed words in Ti respectively. Pi = {Ti Lp} Ni = {Ti Ln}
By using freely available data sources from online social networking sites such as Twitter to perform sentiment analyses, it can serve as an alternative and can overcome some of the limitations of the classical methods for measuring public opinion which is time-consuming, expensive and error prone, wherein these public opinion can be a useful tool for guiding the decisions made by political and business entities. Also, we can explore how public mood patterns relate to the changes that we observe in the social and economic indicators in the same time period.
4. 4.1
METHODOLOGY Data Collection
We collected our own set of data since there are no existing datasets of Twitter regarding the topic or issue that we want to analyze which in this case is the impeachment of the Chief Justice Renato Corona. Using the Twitter API, an application that crawls tweets and stores them in a database was made. The tweet crawler is still available on this website http://202.60.9.206:8080/TweetCrawler/. Using the TweetCrawler, we retrieved the tweets posted between February 22, 2012 and March 19, 2012 that contained keywords 'impeachment', 'renato corona', 'cj corona', 'justice corona', 'CJTrial', 'CjonTrial', 'cjtrial2012', and 'CoronaTrial'. Within that period we gathered more or less 69,346 tweets from 13, 998 users. These tweets contain the following information: Tweet ID, date posted, text message, user name.
The sizes of the resulting sets, |Pi| and |Ni| thus give us the raw count of positively- and negatively-classified words for tweet i. The ratio of the positively-classified words to the number of tokens (Sp) was then compared to the ratio of the negativelyclassified words to the number of tokens (Sn). Sp = |Pi|/|Ti| Sp = |Pi|/|Ti|
The tweet i is classified as positive (1) if Sp>Sn, negative(-1) if Sp<Sn, and neutral(0) otherwise. Note that in this algorithm, positive words prefixed with 'not_' are treated as negatively, while negative words prefixed with 'not_' are treated as positive. The classifier was implemented in Python and used the pythonmysqldb module.
4.2
Data Cleanup
4.3.2
Word Valence Classifier Algorithm
The tweets that were collected were then cleaned up. The cleanup process includes: 1) removal of retweets from collected data, 2) removal of punctuation marks that do not comprise smileys, 3) removal of common words, 4) removal of mentioned twitter usernames in the tweet, and 5) removal of URLs mentioned in the tweet. Negative words (specifically, 'not', 'hindi', and 'di') are connected to the words that follow them using an underscore. Thus, 'not happy' would become 'not_happy' after this step. Also, words which have letters that are repeated three or more times in a row
Just as in simple word-counting algorithm, for the word valence classifier algorithm, each tweet is tokenized to get individual words. Then each word was looked up in the AFINN dictionary to get its corresponding valence, if any. Also, if the word is prefixed with 'not_', then its valence was just computed to be the negative of the valence of the root word. The sentiment of the whole tweet was computed as the average valence of all its words that appeared in the dictionary, thus ignoring the unknown words.
The classifier was also implemented in Python and used the python-mysqldb module as well.
Table 2. Classifier Results (valence considered) Date 22-Feb-12 23-Feb-12 24-Feb-12 25-Feb-12 26-Feb-12 27-Feb-12 28-Feb-12 29-Feb-12 1-Mar-12 2-Mar-12 3-Mar-12 4-Mar-12 5-Mar-12 6-Mar-12 7-Mar-12 8-Mar-12 9-Mar-12 10-Mar-12 11-Mar-12 12-Mar-12 13-Mar-12 14-Mar-12 15-Mar-12 16-Mar-12 17-Mar-12 18-Mar-12 19-Mar-12 Average Sentiment -0.162292699 0.126841317 -0.026434855 0.024590824 -0.200542593 0.104274892 0.103987787 0.167835483 -0.044559281 0.089324532 0.270397926 0.442983258 -0.351885237 0.196401515 0.191881202 0.056838614 -0.56412813 -0.387908213 -0.404268065 0.063889718 -0.094811936 -0.023137363 0.067877804 -0.173219501 -0.182105797 -0.016709144 0.116675236
4.4
Plotting
After classifying the sentiments of each tweet, the classification of each tweet is joined to its date to show the number of positive, negative and neutral tweets on the CJ Corona impeachment trial for each day. The tweet database with the updated sentiments for each tweet is queried. The result of the SQL query is then exported as a CSV file and using Microsoft Excel we plotted the graph of the result.
5.
DISCUSSION OF RESULTS
Table 1 below and its corresponding graph on Figure 1 on the next page, shows the number of positive, negative, and neutral tweets per day as detected by the classifier which does not take into consideration the valence of each word: Table 1. Classifier Results (valence not considered) Date 22-Feb-12 23-Feb-12 24-Feb-12 25-Feb-12 26-Feb-12 27-Feb-12 28-Feb-12 29-Feb-12 1-Mar-12 2-Mar-12 3-Mar-12 4-Mar-12 5-Mar-12 6-Mar-12 7-Mar-12 8-Mar-12 9-Mar-12 10-Mar-12 11-Mar-12 12-Mar-12 13-Mar-12 14-Mar-12 15-Mar-12 16-Mar-12 17-Mar-12 18-Mar-12 19-Mar-12 # of Negative Tweets 59 48 62 30 37 57 70 530 207 91 61 37 101 29 141 96 175 100 133 473 449 252 214 88 85 57 134 # of Positive Tweets 56 71 74 30 39 92 121 590 189 121 89 88 94 45 188 104 97 72 117 580 435 282 283 73 60 73 178 # of Neutral Tweets 154 169 199 105 93 165 277 980 382 247 152 124 185 116 422 300 286 361 720 1746 1043 606 514 247 197 190 430
It can be noticed that there is a significant increase of tweets sent on February 29 compared to the previous day. This is the day when the prosecution team of the impeachment has finished presenting their evidences. This is also the same day when Senator Miriam Defensor Santiago had this outburst against the prosecution team, shouting some cuss words at them. Curiously, the classifier got a positive average sentiment for the tweets on that day. Upon observation on the tweet database, the following tweet on that day got high sentiment value: "Funny impeachment trials. More fun in the Philippines." - from user FREYAfries The existence of the words 'funny' and 'fun' led to the said tweet getting a high sentiment value, as the said words have hive sentiment values based on the dictionary used. For the other days, the classifier scored the highest positive sentiment on March 4, and the lowest negative sentiment on March 9.
Notice that the number of neutral tweets dominates the number of positive and negative tweets. This is most probably due to the several tweets produced by the news agencies; some of them even gives a live blow-by-blow narration of the events concerning the impeachment. Table 2 and its corresponding graph in Figure 2, shows the average sentiment of the tweets per day as detected by the classifier, taking into consideration the valences of the words:
Figure 1. Graph of the number of positive and negative tweets per day
Figure 2. Graph of the average sentiment of the tweets per day
6. CONCLUSION AND SUGGESTIONS FOR FUTURE WORK

In this work, we have performed data mining using Twitter as source of our data for sentiment analysis. A simple word-counting and valence-averaging algorithm was used to classify the sentiment of a tweet. Since, sentiment classification of tweets is a crucial part in this project we suggest to experiment with the machine learning models like Nave Bayes, SVM and MaxEntropy to improve the sentiment classifier for tweets having Filipino or Tagalog words. Consider also the syntactic form of the tweets and the grammatical relationships of words that are used, so collaboration with a Filipino linguist and psychologist is recommended. The approach presented here may be applied to sentiment analyses about other subjects like RH Bill, Freedom of Information, K-12, etc.
[2] Pak, A. and Paroubek, P. Twitter as a Corpus for Sentiment Analysis and Opinion Mining [3] Ellis, Jonathan (2011). Community 2.0 A Sociometric Analysis of London-Based Tweeters. [4] Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., & Booth, R.J. (2007). The development and psychometric properties of LIWC2007. [Software manual]. Austin, TX (www.liwc.net) [5] Nielsen, Finn rup. A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. Proceedings of the ESWC2011 Workshop on 'Making Sense of Microposts': Big things come in small packages 718 in CEUR Workshop Proceedings: 93-98. 2011 May. DOI=http://arxiv.org/abs/1103.2903. [6] Nielsen, Finn rup. Twitter text mining for sentiment analysis of COP15. http://fnielsen.posterous.com/twittertext-mining-for-sentiment-analysis-of= March 12, 2012 [7] Hansen, L. K., Arvidsson, A., Nielsen, F. ., Colleoni, E., Etter, M., "Good Friends, Bad News - Affect and Virality in Twitter", The 2011 International Workshop on Social Computing, Network, and Services (SocialComNet 2011).
7.
REFERENCES
[1] Bollen, J.; Mao, H.; and Pepe, A. (2011) Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media. pp.450-453

(CS283MiniProject) Report - Sambayan and Satuito

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

(CS283MiniProject) Report - Sambayan and Satuito

Transféré par

Droits d'auteur :

Formats disponibles

Sentiment Analysis of Twitter Data on the Chief Justice Renato Corona Impeachment Trial CS 283 Mini-Project

Leo Isiah D. Sambayan leoisiah@yahoo.com 2. Arlene A. Satuito lene.satuito@gmail.com LITERATURE REVIEW

SIGNIFICANCE OF THE PROJECT

4.3.1 Simple Word-Counting Classification Algorithm

METHODOLOGY Data Collection

Word Valence Classifier Algorithm

Figure 2. Graph of the average sentiment of the tweets per day

6. CONCLUSION AND SUGGESTIONS FOR FUTURE WORK

Vous aimerez peut-être aussi