Académique Documents
Professionnel Documents
Culture Documents
B.Sc. Dissertation
RECIFE
2016
Federal University of Pernambuco
Center for Informatics
B.Sc in Computer Science
RECIFE
2016
Acknowledgements
Thank you.
He who is not courageous enough to take risks will accomplish nothing in life.
MUHAMMAD ALI
Resumo
With the spread of the Web 2.0 usage, it has become a popular practice to share content
within microblogging websites, in which its users share their thoughts in short and succinct texts.
One of the most popular microblogging website is Twitter, which limits its users to post their
posts in texts that can have at most 140 characters. Those short texts are highly informal and it
usually express opinions, which led to an interest from the industry to mine those opinions in
order to better understand how their brands and products are perceived through the market. This
work has as its purpose the development of a framework that performs sentiment analysis in
tweets that leverages the use of Deep Learning techniques for such. Firstly, it will be developed a
non-structured data extraction module from Twitter (by using its public API) and a pre-processing
phase with natural language processing techniques. Then, it will be presented a Deep Learning
approach for binary text classification and word embedding with the use of convolutional neural
networks. Lastly, it will be shown a statistical analysis of the algorithm performance along with
a comparative study of how other more traditional algorithms perform within this short and
informal text classification context.
Keywords: Sentiment analysis, text classification, deep learning, natural language processing,
Twitter, microblogging, social networks, opinion mining.
List of Figures
4.1 Distribution of the length of tweets with outliers identified through the interquar-
tile range method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 History of models logarithmic loss over 10 epochs of training. . . . . . . . . . 41
4.3 History of models training accuracy over 10 epochs of training. . . . . . . . . 42
List of Tables
AI Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ANN Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
BOW Bag-of-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CV Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CNN Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
DAG Direct Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
IQR Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
LR Lexical Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
LogReg Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
LSTM Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
NLP Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
OCR Optimal Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
PMI Pointwise Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
POS Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
RNN Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
SGD Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
SO Semantic Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SVM Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
TF-IDF Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Contents
1 Introduction 19
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Basic Concepts 21
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3 Development 29
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Distant Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5 Conclusion 43
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
References 45
Appendix 49
18
A TweetCrawler 51
B Preprocessing 55
B.1 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.2 Corpus Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C SingleLayerCNN 59
19
1
Introduction
Since the Web 2.0 has had its main content generated by its users the data has grown
at an exponential rate - it is estimated that over 90% of the worlds data was generated over
the last two years alone, according to IBM1 . One of the most used mechanism in this new age
of information is the inter-device operability, such that the content is not created only through
notebooks and desktops, but from many devices connected to the world wide web - specially the
smartphones, with people connected to the web through most of the day while consuming and
generating user content.
Some of the most popular applications within the web 2.0 have been the social networks,
in which the users have the ability to share and generate a lot of content in text, photos and
videos. With the surge of new social networks and platforms in which users can publish their
own opinions, texts and thoughts, it allowed for the creating of a new blogging medium known
as microblogging, in which its users share their opinion and thoughts on short and usually very
informal texts in a very similar fashion as in regular SMSs. Within microblogging, a particula
websites have stood out: Twitter, in which its users are limited to 140 characters per share. Once
a lot of those tweets contain short opinions on products, brands and other subjects of interest,
this community gathered a lot of interest from industry and academia interested on mining this
massive amount of data generated on a daily basis.
One of the main branches within opinion mining is called sentiment analysis, which
focus on determining the polarity of a text as positive or negative (or also neutral as well). Many
techniques were developed and a lot of research was done with the purpose of analyzing these
short microblogging texts, usually by applying a machine learning classifier to label a text as
either positive or negative. However, many of the most used vectorial representations of text data
for classifiers are sub-optimal, leading to very sparse representations which are hard to scale at a
production level. New approaches have then started to come with the intent of reducing the need
for sparse representations and to scale the classification for the big data era, while still trying to
maintain the accuracy of the widespread traditional machine learning techniques.
With the advance on hardware technology - specially on RAM memory and on GPU
1 Source: https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
20 CHAPTER 1. INTRODUCTION
development, models based on deep, multi-layered feed-forward neural networks started to draw
attention from the research community. This branch of the machine learning algorithms called
deep learning have since dominated the academia, which although its main concept has existed
for approximately 20 years only with the recent major advances in computational power that it
became possible to develop efficient, scalable implementations for the current big era data.
Deep learning was initially adopted on computer vision and image recognition applica-
tions, but has since found its way in natural language processing. Its main contribution on the field
has been find ways to build more representative and informative text data representations while
also reducing the sparsity of the then adopted form of vectorial text representation of data, which
led to dense and continuous representation of words in low dimensional (few hundreds, compared
to millions in sparse representation) embedding space. The new ways to represent text data then
lead to the development of new architectures of classifiers - specially deep, feed-forward neural
networks - that support those new representations, thus making it very important to research new
algorithms that make use of new techniques achieving state-of-the-art performance while also
leveraging the physical hardware resources that have been greatly improved.
1.1 Objectives
The main objective of this work is the implementation of a binary classifier with a deep
learning based approach for text classification; specifically on text analysis focused on short and
informal texts within the microblogging context.
Also, this work presents the implementation of a full framework that collects text data
from Twitter (through the use of its publicly available APIs) and also the required steps for data
pre-processing using natural language processing techniques. The framework developed in this
work will also be compared with other widely used machine learning classifiers for sentiment
analysis, along with the study and interpretation of the results found.
2
Basic Concepts
In this chapter it will be presented the main technical concepts that will be used along
this work. It will be discussed the fields of natural language processing and supervised machine
learning. Those fields introductions are followed by a deeper study on deep learning, with a
strong emphasis on applications for textual datasets. Lastly, the techniques of word embedding
and convolutional neural networks will be discussed within the context of deep learning.
1 N-grams are sequences of N tokens which appear consecutively across corpora. E.g.: "cat sit walk" is a trigram
(3-gram)., while "cat sit" is a bigram (2-gram).
22 CHAPTER 2. BASIC CONCEPTS
a set of documents. of documents. The granularity of the sentiment analysis depends on the
necessity of the problem. For a given task, the polarity might be defined on the corpora or corpus
level, on a document level or even on the sentence or word level. The most common targets of
the sentiment analysis are usually one of sentence or document level.
On the early developments of sentiment analysis research, often the techniques employed
some use of Lexical Resource (LR) in order to achieve classification. Those methods relied
primarily on: a) word-level polarity defined beforehand, using those LRs then as features
for classification. Turney [Tur02] defined an unsupervised method to infer words Semantic
Orientation (SO) based on the difference of the words Pointwise Mutual Information (PMI)
with the word "excellent" (which has a natural positive SO) and the words PMI with the word
"poor" (which has a natural negative SO), such that PMI is defined as:
p(w1 , w2 )
PMI(w1 , w2 ) = log2
p(w1 )p(w2 )
Where p(w) is the probability of occurrence of the word w and p(w1 , w2 ) is the probabil-
ity of words w1 , w2 mutually occurring. The semantic orientation of the whole sentence was then
computed in a similar fashion, based on the phrases difference of PMI with the word "excellent"
and "poor".
Hu and Lius work[HL04] was also one of the first breakthroughs that awakened academic
interest in the field. They defined words that were likely to be opinative based on a seed list and
identifying adjectives through Part-of-Speech Tagging (POS) that were either synonyms of that
seed word (therefore, should have same polarity) or adjectives that were antonyms of a seed
word (should have opposite polarity). The sentence-level classification was then based on a nave
rolling count basis, in which the difference between the number of likely positive words and the
number of likely negative words. The emergence of lexicon-based methods also prompted the
creation of polarity-specific LRs, such as the SentiWordNet[ES06] polarity dictionary.
Lexicon-based algorithms were followed by then new state-of-art classification models
that used cleverer counting mechanisms - primarily Term Frequency-Inverse Document Fre-
quency (TF-IDF). For a word i in a document j, we have that TF-IDF score for that word is
defined as the product of the term-frequency t fi, j - defined as the number of occurrences of word
i in document j - and the inverse document frequency id f - defined as log2 dNfi , where N is the
total number of documents and d fi is the number of documents containing the word i. This
methods accounts both for the occurrence of the word within a document itself and also for the
relevance of the word within the whole corpus, given that some words generally appear more
often than others (prepositions) thus not carrying discriminative value.
The document is then defined as a vector in a Bag-of-words (BOW) model, in which
the document is represented as a unordered collection of N-grams. This vector is defined in
N
the RV space, where V is the cardinality of the vocabulary and N is the size of the N-gram
used to represent the tokens within the BOW model. That representation implicates in very
24 CHAPTER 2. BASIC CONCEPTS
sparse vectors of length V N for every document in the corpus independent of its true length (in
number of words). Even with very sparse vector representations of documents, logistic regression
classifiers and SVMs [Joa98] showed great performance in supervised environments for text
classification, thus becoming the state of the art model.
Figure 2.1: Basic multi-layer perceptron - a ANN architecture with fully connected input
and hidden layer.
In ANNs, the values passed onto the network through the input layer are usually applied
to a non-linear function in the neurons, such that the network produces better generalizations and
2.4. DEEP LEARNING 25
adaptations for more complex scenarios, thus being able to capture non-linear relationships and
correlations. Although there are many branches of deep architectures, most of them are branched
from its parents ANNs architectures. Deep regular neural networks are trained via Stochastic
Gradient Descent (SGD) and have its weights adjusted via backpropagation. Also, the networks
are also modeled either as a feed-forward network or as a Direct Acyclic Graph (DAG), with
multiple hidden layers - each containing a very high number of hidden units.
Figure 2.2: Illustrative example of a convolutional filter operation over a local section of
a 3-dimensional input feature vector.
The convolutional layer is often followed by the pooling layer, which performs an
aggregative operation over a subsection of the output of the convolutional layer, reducing
26 CHAPTER 2. BASIC CONCEPTS
drastically the size of the vector representation. The most common pooling operation is max-
pooling, which takes the maximum value of either consecutive section of an arbitrary size N over
the vector or over the entire vector itself.
The dropout layer is a fast and computational cheap way to achieve good (empirically)
regularization. In order to prevent overfitting, for each connection or weight in the previous layer
the dropout operation either resets the weight to 0 with a probability p or keeps the weight "as it
is" with a 1 p probability.
Although CNNs are a popular architecture used for many supervised tasks nowadays,
it was already being used in some image classification sets in the beginning of the 2000s.
However, the datasets by then were too small and the computational processing power was weak
[LeC16] such that generative Bayesian algorithms [FFFP07] still outperformed CNN approaches
[LGRN09]. It was not until several years later that with the advance on multipurpose GPU
processors combined with very large-scale image datasets with more than 1 million training
examples spanning 1000 possible categories [RDS+ 15] that breakthrough performance was
reported on classification tasks using CNNs [KSH12].
lookup function f : V Rn , where V is a set of words (or vocabulary) and n is the dimensionality
of the word vector representations defined by the function designer. The n parameter is usually
set such that the dimensional space is fairly high, where often 100 n 500, n N.
Source: [Ron14]
The model depicted in the figure 2.4 introduced by Mikolov et. al in 2013 [MSC+ 13] is
known as word2vec (short for "word to vector"), specifically the skip-gram model. In this model,
the goal for the network is to predict the context words given only one word. The model is
initialized with random gaussian values for the words initial representations and the weights are
updated as the network is trained. The learned vector representations for words have interesting
properties, such as direction encoding semantic properties and basic vector operations yielding
semantic results - the most used example is that the result of f (king) f (man) + f (woman)
yields a vector very close to f (queen) [MYZ13].
The word vector representations are typically used as input for deep neural networks for
different tasks, such as inputs for convolutional neural networks for text classification [Kim14].
There are a number of different word embeddings [BDVJ03] [PSM14] [MSC+ 13], and literature
has shown that although the performance of the word embedding algorithm has an impact on the
overall performance of the learning algorithm, it is not clear whether a representation is better
than other across any dataset, but rather application-specific [ZW15]. In this work, it will be
pre-trained word2vec representation as inputs and it will be further tuned by the network - i.e.
the vector representations for the words are not static.
29
3
Development
The framework developed in this work has many distinct layers spanning from data
collection through accessing public APIs until the development and subsequent evaluation of a
deep learning-based, convolutional neural network classifier. On this chapter it will be presented
the main libraries and development techniques used along the construction of this work. First,
there will be an introduction to Twitters Streaming API that provides the data collected for
classification on this work. Then, the steps required for data pre-processing and label extraction
based on weakly supervised signals will be discussed, followed by the presentation of the libraries
used for training the word vectors. Lastly it will be shown the general architecture of the model
along with the resources needed for the development of such.
For the data collection phase, it was used an endpoint provided by Twitter for collecting
real time data known as the Streaming API1 . This endpoint provides a low latency, high data
frequency for developers. This kind of endpoint is different than a REST API, since it requires
to keep a connection that receives and processes data as it comes from the real time stream. The
data received from the stream can be freely manipulated and stored for posterior use as well and
it is possible to filter the stream by a wide range of parameters, such as keywords, language and
location. An application that runs with Twitters Stream API usually follow the flow as depicted
on figure 3.2.
1 https://dev.twitter.com/streaming/overview
30 CHAPTER 3. DEVELOPMENT
Source: https://dev.twitter.com/streaming/overview
This API provides a 1% sample of the whole live stream of tweets on the network.
Given the microblogging platforms immense popularity (in January 2016, it was estimated
that there were 303 million tweets per day on an average basis2 , public tweets are very easy
to collect given the streams high throughput, which made the Stream API a very attractive
resource for researches and practitioners to collect this data. Although the Firehose Stream
API3 provides 100% of the public statuses that are published in the network, this endpoint has a
very high financial cost associated (both for bandwith costs and for licensing and permission of
use). Although the Stream API is a highly used resource for data collection (specially for short,
informal texts), it should be noted that studies have shown that Twitters sampling method to
select the 1% for the streaming connection might contain bias depending on the time period of
when the data is being collected [MPLC13] [MPL14] such as when big worldwide events are
occuring (e.g.: during 2014s World Cup), this concern it is not addressed in this work.
want simple tasks (as the case of categorization of short texts) to be done by humans and workers
that get paid on a task-completion basis, it is not cost effective and therefore should be used in
cases when there are no other viable alternatives.
In order to eliminate the need for annotating by hand the data, it was used a labeling
technique based on pseudo signals called distant supervision [GBH09]. In the case of casual
conversation on the web environment, it is common practice to use emojis (also: "smiley faces"
or "emoticons") as a way to express ones sentiment together with the message. Table 3.1 shows
a mapping of some of the most popular smileys to the sentiment associated to it.
Those smileys are used as pseudolabels that propagate their sentiment over the text. This
means that whenever a user inputs one of those smileys together with the text, it is likely that
the user expressed the same sentiment as the smiley used (in the exception of sarcasm). Those
smileys were passed as parameters to track tweets from the Streaming API that had at least one
of the emoticons on table 3.1 and were in English language.
The description above is a high-level abstraction of the implementation of the data
collector, which is a Python6 class that leverages the use of an open-source implementation
python-twitter7 that wraps requests for many of Twitters endpoints for usage within
Python code8 . The full code of the crawler is listed on Appendix A.
6 Widely used programming language for scientific computing. Homepage: https://www.python.org.
7 https://github.com/bear/python-twitter
8 Unless explicitly noted, all of the code for this work was written in Python.
32 CHAPTER 3. DEVELOPMENT
3.2.1 Tokenization
Although tweets - or any text-based data - are composed of sequences of characters and
words, the Streaming API does not return the tweet as a sequence of tokens, but as a raw string.
The tokenization step is necessary for algorithms to work on top of sequence of words.
Twitter has a very specific grammar and (most of) its users are highly informal in the way
they express themselves. Regular tokenization techniques that splits the string on punctuation
or algorithms such as maximum matching are not good choices to tokenize this particular form
of text data. In this work, it was used an open source implementation9 of a Twitter-specific
tokenization technique [GSO+ 11][OKA10] to generate the tokens.
3.2.2 Normalization
Some of the most common words do not carry much semantic value by themselves such
as prepositions, articles and some conjunctions. Those words that do not have any significance
that will determine the meaning behind the sentence itself are known as stopwords. Each of
these words are removed from the sentence - with the exception of the negation not, which is
very important for sentiment classification given that it changes the polarity of what comes after
the negation. Punctuations and tokens that have less than three (3) characters are also removed.
On Twitter, it is possible to mention another user in the network if one places an "at" (@)
symbol before the others username. Although it could be useful if a tweet contains a mention
or not, it is not very useful when analysing polarity of a single tweet to know which user is
being mentioned. Same logic applies to URLs - although it could be important to know if a user
tweeted a text together with a URL, it is probably not that useful to store which URL is. Then,
every URL and mention found in a tweet is replaced by a special placeholder token as seen in
table 3.2.
Another common step in preprocessing text data is to apply stemming in each word.
Stemming reduces variations, inflectional and derivative forms of a word to its root (or stem)
- e.g.: the stem for "automate", "automates", "automatic" and "automation" is "automat". It
should be noted that no stemming techniques were applied on the tokens on this work, given
that the words on this kind of text are usually already abbreviated or shortened versions of the
9 Source
code of the implementation at: https://github.com/myleott/ark-twokenize-py/
blob/master/twokenize.py.
3.3. WORD EMBEDDING 33
original word such that applying stemming does not lead to much improvement on reducing the
vocabulary size.
The remaining smiley faces on each tweet are removed. This step is necessary because
since the label is being assigned to the tweet according to the emoticon found on the sentence,
this would introduce bias in the classifier - e.g.: every time that a sentence with a smiley face
was to be classified it would always yield the same prediction as the polarity of the smiley.
Finally, the sequence is padded to a fixed length of tokens. This step is performed only
for the convolutional neural network classifier, since it requires a fixed length vector for input.
This padding is done by prepending a special token <pad/> until the tweet reaches the expected
sequence length or by removing tokens from the beginning to the end until it is reduced to the
expected length (the fixed length and the reason for the length of choice will be discussed in the
Results chapter).
The techniques described in this section were used to implement an efficient, memory-
friendly reader for the collected text data via Twitters Streaming API. The code is available for
reference in the Appendix B.
classification [TWQ+ 14] [TWY+ 14] and for general text classification [Kim14].
The are a number of libraries that provide the requirements needed to model CPU-heavy,
numeric computing based on data flow graphs. The most used libraries are Theano13 , Berkeleys
Caffe14 and Googles TensorFlow15 . However those two libraries are not always easy to use, so
there are also other open source libraries that provide APIs that run on top of numeric, matrix
computing libraries, such as Torch16 and Keras17 . In this work, Keras was used to build the
convolutional neural network model since it provides a Python API and runs on top of Theano.
The model follows close to the baseline reported on [Kim14] [ZW15]. The model
implemented in this work uses pre-trained word embeddings vectors (from word2vec) as weights
to the embedding layer. Although this technique raises the complexity of the model, it has been
show that it improves the performance of the model [Kim14].
Figure 3.3: High level abstraction of the implemented convolutional neural network
model.
The model has four convolutional layers at the same depth. Each convolutional layer has
13 http://deeplearning.net/software/theano/
14 http://caffe.berkeleyvision.org
15 http://tensorflow.org
16 http://torch.ch
17 http://keras.io
3.5. BASELINE MODELS 35
its own filter length size of 4, 5, 6 e 7. Using multiple filters of different sizes allows the network
to capture different features at different lengths of each feature vector. 1-max pooling is then
performed on the output of each convolutional filter, the output of the pooling layer is merged and
passed through a dropout layer with 0.5 probability of keeping each feature. The output of the
dropout layer is then fully connected to the 2-dimensional softmax output layer for classification.
The model is trained with backpropagation, SGD for optimization and categorical cross-entropy
for loss. The full model architecture is presented on figure 3.3. The implementation of the is
depicted on Appendix C.
18 http://scikit-learn.org/stable/
37
4
Results and Analysis
In order to validate the effectiveness of the model and techniques described, the model
is tested against some traditional, baseline supervised machine learning classifiers - a logistic
regression classifier and a support vector machine. This chapter describes the environment in
which the tests were performed, the collected dataset and the performance for the aforementioned
classifiers.
4.1 Environment
Unless explicitly noted otherwise, every test describe in this chapter was run in the
environment described below.
Hardware Environment
Software Environment
gensim 0.12.4
keras 1.1.0
nltk 3.2.1
numpy 1.11.2
scikit-learn 0.18
tensorflow 0.11.0rc0
theano 0.9.0
38 CHAPTER 4. RESULTS AND ANALYSIS
Figure 4.1: Distribution of the length of tweets with outliers identified through the
interquartile range method.
A simple technique to detect outliers is the Interquartile Range (IQR) based method.
The IQR is defined as difference between the third quartile (Q3 ) and the first quartile (Q1 ). A
potential outlier is defined as an observation that falls either below Q1 1.5 IQR or above
Q3 + 1.5 IQR. The box plot visualization on figure 4.1 helps visualize observations that fall
outside of this valid range.
4.3. WORD VECTORS 39
Through that simple rule it is possible to determine that tweets that have length of at least
19 tokens should be considered as outliers - the length of sequence padding was then defined as
20. Then, every tweet of length L, L > 20 that has more than 20 tokens has its first L 20 tokens
stripped from the sequence and every tweet of length L, L < 20 has 20 L special padding tokens
<pad/> appended to the beginning of the sequence (prepend).
4.4 Classifiers
Three main models were compared: the main CNN architecture developed in this work,
a logistic regression supervised and a support vector machine with linear kernel classifier. This
section will provide a better description of the developed models complexity and an evaluation
of its performance in contrast with the other two baseline classifiers.
The increasing number of parameters is mainly due to the fact that the word vectors
are passed as initial weights to the embedding layers. This technique allows the word vectors
to be refined and have its values updated as the model is trained. This approach is not always
necessary and in some experiments settings just using the static pre-trained word vectors learned
a priori yields almost as good performance as using non-static vectors [ZW15]. The table 4.4
below shows how each layer influences on the number of the parameters on each model.
4.4. CLASSIFIERS 41
Table 4.4: Number of parameters for each layer of the CNN models.
In this section, the model purposed in the work has its performance compared with two
of the most used traditional machine learning algorithms in text classification: logistic regression
and support vector machines. The figure 4.2 shows how the logarithmic loss of each model
decreases over the epochs. Also, the figure 4.3 show how the training accuracy increases over the
training epochs. It is observed that the higher the dimensionality of the word embedding, smaller
the model logarithmic loss and higher the training accuracy. The results on the evaluation of
the CNN model variations and on the baseline models are listed on table 4.5. The models were
evaluated on precision, recall, macro F-score and accuracy. For the baseline models, the mean
and standard deviation across 3-fold cross validation as (mean, std).
Table 4.5: Results of the CNN variations developed compared against baseline classifiers.
From the results of the models described on table 4.5 it is easy to note that the size of
the embedding for the word vectors influences greatly on the accuracy of the model - there is an
increase of 3.5 percentual points on accuracy by just increasing the size of the dimensionality.
However, the improvement on the model classification comes at the cost of an increase in total
number of parameters and on training time, as pointed on table 4.3.
The CNN200 variation performs very comparably to the logistic regression model and
it is closely aligned with previous work [Kim14], although the SVM with linear kernel model
still outperforms reasonably the other models evaluated. The main advantages of the SVM
and LogReg models in this scenario are the ease of implementation and data preparation for
classification. However, as the dataset grows larger, it is expected that the CNN model rivals and
eventually beats the SVM benchmark. Also, the need for traditional models to work with sparse
matrices makes it hard to scale the model.
43
5
Conclusion
In this work, it was presented a full end-to-end framework to collect short and informal
texts from the web - specifically from the microblogging social network Twitter, via connection to
its real-time Streaming API - and to evaluate the performance of a deep learning-based classifier
for classifying tweets as either positive or negative.
One of the major drawbacks in using a convolutional neural network for classification
is the complexity of the model. The number of parameters to estimate is fairly high even for a
shallow network that does not run in a production environment - leading to a very high training
time that does not scale on CPU, even for such a small dataset. In order for one to take full
advantage of a CNN model, it is required to leverage the use of GPUs to enhance the training
speed of a CNN by a factor of a 100 [SSB10].
Also, this kind of approach does not seem to be the necessarily best choice for small
datasets. In a dataset of the order of thousands a SVM with a linear kernel classifier has the
highest presented accuracy while using a simple representation of the data - a uni-bigram
bag-of-words (BOW) model with TF-IDF vectorization.
The main contribution of this work, however, is the study of the presented modes
performance for classification on this particular kind of text data, based on previous work in
the field either in general text classification [Kim14] or in Twitter itself [TWQ+ 14] [TWY+ 14].
While some of the most accurate classifiers are built on top of extensive feature engineering
[MKZ13], the presented model shifts the focus from letting the model itself learn the best
possible inner representation for the data.
This approach is highly sensitive to the amount of training data for the pre-trained
unsupervised word vectors and for the classifier as well. Also, it is expected that a CNN-based
approach achieves state-of-the-art performance as the datasets grows larger and it is highly
scalable on production environments that leverages the use of GPUs and does not need to use a
very large sparse matrix for data representation.
44 CHAPTER 5. CONCLUSION
References
[BDVJ03] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural
probabilistic language model. journal of machine learning research, 3(Feb):1137
1155, 2003.
[BF10] Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from
biased and noisy data. In Proceedings of the 23rd International Conference on
Computational Linguistics: Posters, pages 3644. Association for Computational
Linguistics, 2010.
[Dom12] Pedro Domingos. A few useful things to know about machine learning. Commun.
ACM, 55(10):7887, October 2012.
[ES06] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In Proceedings of LREC, volume 6, pages 417422.
Citeseer, 2006.
[FFFP07] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101 object
categories. Computer Vision and Image Understanding, 106(1):5970, 2007.
[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in
preparation for MIT Press, 2016.
[GBH09] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.
[GL14] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.s
negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.
[GSO+ 11] Kevin Gimpel, Nathan Schneider, Brendan OConnor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 4247.
Association for Computational Linguistics, 2011.
[HL04] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceed-
ings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 04, pages 168177, New York, NY, USA, 2004. ACM.
46 REFERENCES
[Joa98] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. In European conference on machine learning, pages
137142. Springer, 1998.
[JSL+ 16] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda B. Vigas, Martin Wattenberg, Greg Corrado, Mac-
duff Hughes, and Jeffrey Dean. Googles multilingual neural machine translation
system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016.
[Kim14] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882, 2014.
[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems
25, pages 10971105. Curran Associates, Inc., 2012.
[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436444, 2015.
[LGRN09] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional
deep belief networks for scalable unsupervised learning of hierarchical represen-
tations. In Proceedings of the 26th annual international conference on machine
learning, pages 609616. ACM, 2009.
[MKB+ 11] Tomas Mikolov, S. Kombrink, L. Burget, J. Cernock, and S. Khudanpur. Extensions
of recurrent neural network language model. In 2011 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 55285531, May
2011.
[MKZ13] Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. Nrc-canada: Building
the state-of-the-art in sentiment analysis of tweets. In Proceedings of the seventh
international workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta,
Georgia, USA, June 2013.
[MPL14] Fred Morstatter, Jrgen Pfeffer, and Huan Liu. When is it biased?: assessing the rep-
resentativeness of twitters streaming api. In Proceedings of the 23rd International
Conference on World Wide Web, pages 555556. ACM, 2014.
REFERENCES 47
[MPLC13] Fred Morstatter, Jrgen Pfeffer, Huan Liu, and Kathleen M Carley. Is the sample
good enough? comparing data from twitters streaming api with twitters firehose.
arXiv preprint arXiv:1306.5204, 2013.
[MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 31113119, 2013.
[MYZ13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in
continuous space word representations. In HLT-NAACL, volume 13, pages 746751,
2013.
[OKA10] Brendan OConnor, Michel Krieger, and David Ahn. Tweetmotif: Exploratory
search and topic summarization for twitter. In ICWSM, 2010.
[PS10] Peter Prettenhofer and Benno Stein. Cross-language text classification using struc-
tural correspondence learning. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, pages 11181127. Association for
Computational Linguistics, 2010.
[PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In EMNLP, volume 14, pages 153243, 2014.
[RDS+ 15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), 115(3):211252, 2015.
[Ron14] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738,
2014.
[SSB10] Dominik Scherer, Hannes Schulz, and Sven Behnke. Accelerating large-scale con-
volutional neural networks with parallel graphics multiprocessors. In International
Conference on Artificial Neural Networks, pages 8291. Springer, 2010.
48 REFERENCES
[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and
K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,
pages 31043112. Curran Associates, Inc., 2014.
[TWQ+ 14] Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and Ming Zhou. Coooolll: A deep
learning system for twitter sentiment classification. In Proceedings of the 8th
International Workshop on Semantic Evaluation (SemEval 2014), pages 208212,
2014.
[TWY+ 14] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In ACL (1),
pages 15551565, 2014.
[ZW15] Ye Zhang and Byron C. Wallace. A sensitivity analysis of (and practitioners guide
to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820,
2015.
Appendix
51
A
TweetCrawler
This appendix contains the source code used for connecting to Twitters Streaming API,
using the smiley faces in table 3.1 as keywords for the track parameter and English as target
language. To avoid storing every tweet in memory (which comes with metadata embedded), the
class uses a buffer that whenever it reaches its full capacity (default set to 1000 tweets) dumps
its content on disk and clears the buffer.
# coding=utf-8
import sys
import codecs
import twitter
import ujson as json
import cPickle as pickle
import time
from datetime import datetime
from random import choice
from string import hexdigits
class TweetCrawler:
LANG = "lang"
TEXT = "text"
ENGLISH = "en"
NEW_LINE = "\n"
return False
def get_keys(self):
with open(self.key_file_path, "r") as keyfile:
keys = [key.strip() for key in keyfile]
return keys
def get_api_connection(self):
keys = self.get_keys()
return twitter.Api(keys[0], keys[1], keys[2], keys[3])
def init_stream(self):
api = self.get_api_connection()
self.stream = api.GetStreamFilter(track=self.keywords)
def __random_prefix__(self):
return .join([choice(hexdigits) for i in xrange(6)])
def is_buffer_full(self):
return len(self.buffer) >= self.buffer_limit
def get_file_name(self):
now = time.time()
dt = datetime.fromtimestamp(now)
ts = "%d%d%dT%d%d" % (dt.year, dt.month, dt.day, dt.hour, dt.minute
)
fname = self.__random_prefix__() + "_" + ts + "_tweets.json"
return fname
def main():
keywords = POS + NEG
limit = sys.argv[-1]
crawler = TweetCrawler(./crawler/keys.txt,
keywords=keywords,
buffer_limit=1000)
crawler.crawl_tweets()
if __name__ == __main__:
main()
55
B
Preprocessing
B.1 Parser
The following script is the implementation used in this word for preprocessing a tweet.
It tokenizes the tweet, filters unwanted words (stopwords, punctuation, short tokens), removes
smiley faces, applies placeholders for normalization and pads the sentence.
import ujson as json
import re
from twokenize import tokenizeRawTweetText
from nltk.corpus import stopwords
from string import digits, punctuation
from hyper_params import SEQUENCE_LENGTH
import re
urlrxp = re.compile(
r^(?:http|ftp)s?:// # http:// or https://
r(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A
-Z0-9-]{2,}\.?)| #domain...
rlocalhost| #localhost...
r\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) # ...or ip
r(?::\d+)? # optional port
r(?:/?|[/?]\S+)$, re.IGNORECASE
)
CUT = 3
MENTION = "<mnt/>"
URL = "<url/>"
PAD = "<pad/>"
def is_url(token):
return urlrxp.match(token) is not None
def only_digits_or_punctuation(token):
return all(i in BAD_CHARS for i in token)
def is_too_short(token):
return len(token) < CUT
def pad(sequence):
len_seq = len(sequence)
if len_seq > SEQUENCE_LENGTH:
return sequence[-SEQUENCE_LENGTH:]
padded = [PAD] * SEQUENCE_LENGTH
padded[-len_seq:] = sequence
return padded
def should_be_kept(token):
token = token.lower()
return not (token in EN_STOP or \
token in EMOJIS or \
only_digits_or_punctuation(token) or \
is_too_short(token))
def placeholder_and_lower(token):
if token[0] == "@":
return MENTION
elif is_url(token):
return URL
else:
return token.lower()
TWEET_SUFFIX = "tweets.json"
TEXT = "text"
LABEL = "label"
PAD = "<pad/>"
class TweetCorpusReader(object):
"""
Class for reading efficiently the Twitter corpora.
Reads and parses tweets on the fly, returning
only those that have at least two tokens.
"""
def __init__(self, data_path, text_only=True, w2i=None, debug=False,
should_pad=True):
super(TweetCorpusReader, self).__init__()
self.data_path = data_path
self.text_only = text_only
self.embedding = False
self.w2i = w2i
self.unk = None if w2i is None else len(self.w2i)
self.debug = debug
self.should_pad = should_pad
return padded
def __iter__(self):
json_files = []
for json_file in os.listdir(self.data_path):
if json_file.endswith(TWEET_SUFFIX):
json_files.append("{0}/{1}".format(self.data_path,
json_file))
for json_file in json_files:
with open(json_file, "r") as f_in:
tweets_in_file = [self.load_and_process(doc) for doc in
f_in]
if self.w2i is None:
for tweet, label in tweets_in_file:
if len(tweet) > 1:
yield tweet if self.text_only else (tweet,
label)
else:
for tweet, label in tweets_in_file:
if len(tweet) > 1:
indexes = self.__pad__(tweet)
yield (np.array(indexes), self.__get_output__(
label))
if self.debug:
break
59
C
SingleLayerCNN
This appendix contains the Python code for the implementation provided for building the
convolutional neural network model for sentiment classification on short texts.
from gensim.models.word2vec import Word2Vec
from keras.models import Model
from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Merge,
Convolution1D, MaxPooling1D
from types import ListType
import hyper_params
import ujson as json
from time import asctime
class SingleLayerCNN(object):
"""docstring for SingleLayerCNN."""
def __init__(self, seq_len, emb_dim, filter_len, feature_maps,
w2v_model_path):
super(SingleLayerCNN, self).__init__()
self.seq_len = seq_len
self.emb_dim = emb_dim
self.filter_len = filter_len
self.feature_maps = feature_maps
self.w2v = None
self.w2v_model_path = w2v_model_path
self.model = self.__build_cnn__(w2v_model_path)
assert emb_dim == self.w2v.vector_size, \
word2vec expects vector_size=%d, actual=%d % (self.w2v.
vector_size, emb_dim)
"w2v": self.w2v_model_path,
"hist": hist
}
def __build_conv_layer__(self):
layer_name = "embedding_input"
emb_input = Input(shape=(self.seq_len, self.emb_dim), dtype=int32
, name=layer_name)
conv_layers = []
assert type(self.filter_len) is ListType, \
"filter_len=%s is not a list" % (str(self.filter_len))
embedding = Embedding(
input_dim=vocab_size,
output_dim=self.emb_dim,
input_length=self.seq_len,
61
weights=[self.w2v.syn0])(main_input)
conv = conv(embedding)
conv = Dropout(.5)(conv)
conv = Dense(2, activation="softmax")(conv)
model = Model(input=main_input, output=conv)
model.compile(loss=categorical_crossentropy, optimizer=sgd,
metrics=[accuracy])
return model
def word2index(self):
return {word: i for i, word in enumerate(self.w2v.index2word)}