Vous êtes sur la page 1sur 63

Guilherme Palma Peixoto

A DEEP LEARNING-BASED APPROACH FOR SENTIMENT


ANALYSIS ON SHORT MICROBLOGGING TEXTS

B.Sc. Dissertation

Federal University of Pernambuco


secgrad@cin.ufpe.br
www.cin.ufpe.br/~graduacao

RECIFE
2016
Federal University of Pernambuco
Center for Informatics
B.Sc in Computer Science

Guilherme Palma Peixoto

A DEEP LEARNING-BASED APPROACH FOR SENTIMENT


ANALYSIS ON SHORT MICROBLOGGING TEXTS

A B.Sc. Dissertation presented to the Center for Informatics


of Federal University of Pernambuco in partial fulfillment
of the requirements for the degree of Bachelor in Computer
Science.

Advisor: Tsang Ing-Ren

RECIFE
2016
Acknowledgements

To my adviser Tsang Ing-Ren, who introduced in my first year of graduation to what


Im passionate for studying and has since motivated me to always keep learning and developing
myself,
to the Center of Informatics at Federal University of Pernambuco and Stevens Institute
of Technology, which sharpened me as a man and professional and made me learn in the hardest
way that you have to fight to achieve your goals,
to my friends and colleagues at the Data Mining and Machine Learning Lab, specially
Justin Sampson, Fred Morstatter and Dr. Huan Liu for accepting and having me in their lab; with
their knowledge and guidance, I learned how to learn and have greatly helped me,
to my friends and coworkers at In Loco Media, specially Alan Gomes, which believed in
me and has helped me to develop,
to my friends James, Nick, Sam and Tyler; all of whom didnt need to do anything for
me but were friends in a tough and lonely time and made me belong there,
to Mike, Barbara, Ryan and Lilly, who took a stranger into their house and gave me a
family and a second home,
to my friends at "Lontra da Lombra" for everything weve been through in the college
years and for everything well stil go through; specially Duhan, Lucas, Mateus and Vincius,
to my gone way too soon cousin and older brother Alexandre; who I grew with, always
looked up to and made part of the man I am today and made me learn to appreciate every single
day of my life,
to my dear parents Mara and Eduardo; who always gave me just the right blend of soft
and tough love and supported, educated and prepared me to the outside world,
to my lovely companion, girlfriend and best friend Isadora; who has always been there
for me and supported me and never quit believing on me,
and to my brother Gustavo, who I will live and die for - youre me hero:

Thank you.
He who is not courageous enough to take risks will accomplish nothing in life.
MUHAMMAD ALI
Resumo

Uma forma que se tornou popular de compartilhar contedo dentro do contexto da


Web 2.0 so os sites de microblogging nos quais seus usurios postam seus pensamentos em
formatos de textos curtos e sucintos. O site mais popular de microblogging o Twitter, que
limita seus usurios a postarem textos com no mximo 140 caracteres. Esses textos tem um
carter extremamente opinativo, o que ocasionou um interesse da indstria em analisar o que
o pblico tem comentado sobre suas marcas e produtos dentro dessa rede. Esse trabalho tem
como propsito o desenvolvimento de uma ferramenta que realiza anlise de sentimento a partir
de tweets , utilizando tcnicas de Deep Learning para tal. Primeiro, ser desenvolvido um
mdulo de extrao e processamento de dados no estruturados do Twitter, com o uso de APIs
pblicas e tcnicas de processamento de linguagem natural. Posteriormente, ser realizado o
desenvolvimento de um algoritmo de classificao binrio de fragmentos de textos com o uso de
redes neurais convolucionais para classificao e transformao de palavras em vetores reais.
Por fim, ser realizado uma anlise estatstica da performance do algoritmo desenvolvido e ser
realizado um caso de estudo comparativo com outros algoritmos utilizados dento do contexto da
classificao de texto curtos, informais e opinativos.

Palavras-chave: Anlise de sentimento, classificao de texto, deep learning, processamento


de linguagem natural, Twitter, microblogging, redes sociais, minerao de opinio.
Abstract

With the spread of the Web 2.0 usage, it has become a popular practice to share content
within microblogging websites, in which its users share their thoughts in short and succinct texts.
One of the most popular microblogging website is Twitter, which limits its users to post their
posts in texts that can have at most 140 characters. Those short texts are highly informal and it
usually express opinions, which led to an interest from the industry to mine those opinions in
order to better understand how their brands and products are perceived through the market. This
work has as its purpose the development of a framework that performs sentiment analysis in
tweets that leverages the use of Deep Learning techniques for such. Firstly, it will be developed a
non-structured data extraction module from Twitter (by using its public API) and a pre-processing
phase with natural language processing techniques. Then, it will be presented a Deep Learning
approach for binary text classification and word embedding with the use of convolutional neural
networks. Lastly, it will be shown a statistical analysis of the algorithm performance along with
a comparative study of how other more traditional algorithms perform within this short and
informal text classification context.

Keywords: Sentiment analysis, text classification, deep learning, natural language processing,
Twitter, microblogging, social networks, opinion mining.
List of Figures

2.1 Basic multi-layer perceptron - a Artificial Neural Networks (ANN) architecture


with fully connected input and hidden layer. . . . . . . . . . . . . . . . . . . . 24
2.2 Illustrative example of a convolutional filter operation over a local section of a
3-dimensional input feature vector. . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Illustrative example of a pooling operation over an intermediate vector. . . . . . 26
2.4 Skip-gram model architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.1 Twitter Stream API flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


3.2 Example of tweets with smiley faces. . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 High level abstraction of the implemented convolutional neural network model. 34

4.1 Distribution of the length of tweets with outliers identified through the interquar-
tile range method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 History of models logarithmic loss over 10 epochs of training. . . . . . . . . . 41
4.3 History of models training accuracy over 10 epochs of training. . . . . . . . . 42
List of Tables

3.1 Sentiment polarity to emoji mapping. . . . . . . . . . . . . . . . . . . . . . . 31


3.2 Mention and hyperlinks normalization to placeholders. . . . . . . . . . . . . . 33

4.1 word2vec parameters table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


4.2 Common parameters for every Convolutional Neural Network (CNN) imple-
mented model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Description of total training time for each CNN. . . . . . . . . . . . . . . . . . 40
4.4 Number of parameters for each layer of the CNN models. . . . . . . . . . . . . 41
4.5 Results of the CNN variations developed compared against baseline classifiers. 42
List of Acronyms

AI Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
ANN Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
BOW Bag-of-words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CV Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
CNN Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
DAG Direct Acyclic Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
IQR Interquartile Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
LR Lexical Resource . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
LogReg Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
LSTM Long Short Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
NLP Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
OCR Optimal Character Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
PMI Pointwise Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
POS Part-of-Speech Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
RNN Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
SGD Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
SO Semantic Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
SVM Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
TF-IDF Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Contents

1 Introduction 19
1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Basic Concepts 21
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Development 29
3.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.1 Distant Supervision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Data Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Word Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Baseline models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results and Analysis 37


4.1 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Word Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.1 CNNs Model Complexity . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusion 43
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

References 45

Appendix 49
18

A TweetCrawler 51

B Preprocessing 55
B.1 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
B.2 Corpus Reader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

C SingleLayerCNN 59
19

1
Introduction

Since the Web 2.0 has had its main content generated by its users the data has grown
at an exponential rate - it is estimated that over 90% of the worlds data was generated over
the last two years alone, according to IBM1 . One of the most used mechanism in this new age
of information is the inter-device operability, such that the content is not created only through
notebooks and desktops, but from many devices connected to the world wide web - specially the
smartphones, with people connected to the web through most of the day while consuming and
generating user content.
Some of the most popular applications within the web 2.0 have been the social networks,
in which the users have the ability to share and generate a lot of content in text, photos and
videos. With the surge of new social networks and platforms in which users can publish their
own opinions, texts and thoughts, it allowed for the creating of a new blogging medium known
as microblogging, in which its users share their opinion and thoughts on short and usually very
informal texts in a very similar fashion as in regular SMSs. Within microblogging, a particula
websites have stood out: Twitter, in which its users are limited to 140 characters per share. Once
a lot of those tweets contain short opinions on products, brands and other subjects of interest,
this community gathered a lot of interest from industry and academia interested on mining this
massive amount of data generated on a daily basis.
One of the main branches within opinion mining is called sentiment analysis, which
focus on determining the polarity of a text as positive or negative (or also neutral as well). Many
techniques were developed and a lot of research was done with the purpose of analyzing these
short microblogging texts, usually by applying a machine learning classifier to label a text as
either positive or negative. However, many of the most used vectorial representations of text data
for classifiers are sub-optimal, leading to very sparse representations which are hard to scale at a
production level. New approaches have then started to come with the intent of reducing the need
for sparse representations and to scale the classification for the big data era, while still trying to
maintain the accuracy of the widespread traditional machine learning techniques.
With the advance on hardware technology - specially on RAM memory and on GPU
1 Source: https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
20 CHAPTER 1. INTRODUCTION

development, models based on deep, multi-layered feed-forward neural networks started to draw
attention from the research community. This branch of the machine learning algorithms called
deep learning have since dominated the academia, which although its main concept has existed
for approximately 20 years only with the recent major advances in computational power that it
became possible to develop efficient, scalable implementations for the current big era data.
Deep learning was initially adopted on computer vision and image recognition applica-
tions, but has since found its way in natural language processing. Its main contribution on the field
has been find ways to build more representative and informative text data representations while
also reducing the sparsity of the then adopted form of vectorial text representation of data, which
led to dense and continuous representation of words in low dimensional (few hundreds, compared
to millions in sparse representation) embedding space. The new ways to represent text data then
lead to the development of new architectures of classifiers - specially deep, feed-forward neural
networks - that support those new representations, thus making it very important to research new
algorithms that make use of new techniques achieving state-of-the-art performance while also
leveraging the physical hardware resources that have been greatly improved.

1.1 Objectives
The main objective of this work is the implementation of a binary classifier with a deep
learning based approach for text classification; specifically on text analysis focused on short and
informal texts within the microblogging context.
Also, this work presents the implementation of a full framework that collects text data
from Twitter (through the use of its publicly available APIs) and also the required steps for data
pre-processing using natural language processing techniques. The framework developed in this
work will also be compared with other widely used machine learning classifiers for sentiment
analysis, along with the study and interpretation of the results found.

1.2 Document Structure


The work (beyond this chapter) is divided as it follows in order to present the full picture
of its contributions in the most suitable way.
The "Basic Concepts" chapter presents the basic concepts and theoretical background in
the making of this work. The "Development" chapter presents the framework technology and the
main model architecture developed. The "Results and Analysis" chapter presents a statistical
analysis and a comprehensive study of the collected data and developed model, also analyzing
its performance with other well established classifiers. Lastly, the "Conclusion" chapter brings
the conclusion and final remarks of the work, as well as intended future works.
21

2
Basic Concepts

In this chapter it will be presented the main technical concepts that will be used along
this work. It will be discussed the fields of natural language processing and supervised machine
learning. Those fields introductions are followed by a deeper study on deep learning, with a
strong emphasis on applications for textual datasets. Lastly, the techniques of word embedding
and convolutional neural networks will be discussed within the context of deep learning.

2.1 Natural Language Processing


Natural Language Processing (NLP) is the computer science field which devotes its
research efforts to understand human generated content through texts and human languages.
The main challenge in the natural language processing field is understanding, which means the
algorithm should be able to recognize many aspects of the text such as: intent, entities mentioned
or associated, topic, polarity (opinative or informative), question recognition and/or answering.
For many years, NLP had focused its main efforts on statistical language models. Those
models focused on generating text based on n-grams1 occurrence counting. Those models were
greatly used in grammar-related tasks for NLP, such as probabilistic parsing, named entity
recognition, tokenization. However, with the rapid growth of academic interest in deep learning,
NLP has taken advantage from the widespread usage and development of neural networks.
Many tasks which formerly employed a more traditional probabilistic approach, started to pivot
towards using recurrent; deep feed-forward; Long Short Term Memory (LSTM) neural networks
[MKB+ 11][BDVJ03]. Tasks such as machine translation (the problem of translating a text from
a source language to a target language) and sentiment analysis have benefited from new state of
the art neural networks architectures [SVL14][Kim14], surpassing the statistical approaches and
often used established classification models, such as support vector machines Support Vector
Machine (SVM) and logistic regression.

1 N-grams are sequences of N tokens which appear consecutively across corpora. E.g.: "cat sit walk" is a trigram
(3-gram)., while "cat sit" is a bigram (2-gram).
22 CHAPTER 2. BASIC CONCEPTS

2.2 Machine Learning


Machine learning is a branch of computer science which studies algorithms that have
the capability of developing learning without being explicitly programmed how to learn. Al-
though the proper definition of learning in this context is still loose, it is usually attributed to
many variations of pattern recognition tasks. Instead of the classic, maybe romanticized view
of Artificial Intelligence (AI) in which computers are able to perform human-like reasoning,
machine learning algorithms are designed to perform in an automated fashion tasks in which
humans were able to develop through observation, experience and expertise. For example, it
is easy for a human to distinguish between a bird and a chair - not as easy for an algorithm.
However, while it is unfeasible for any subset of human beings to an extremely large amount of
images of chairs and birds - say 1 billion images - it is increasingly getting easier for a Computer
Vision (CV) algorithm to perform a fast classification in this given dataset in a matter of minutes.
There are two main categories of learning tasks within machine learning; unsupervised
and supervised. In unsupervised tasks, there are no annotations for a given set of observations -
the algorithm is responsible to uncover hidden patterns based on the observations only. Common
unsupervised tasks are clustering and anomaly detection. It should be noted however that the
focus of this work is on a supervised learning problem instance.
Supervised tasks (or algorithms) are those in which there is a corresponding known a
priori value for the target variable of classification for each observation. Supervised learning
branches mainly into classification and regression tasks. In classification tasks, for each ob-
servation x in a given dataset D there is a categorical (discrete) target class cx from a set of
classes C such that cx C. Therefore, given a set of observations D0, the algorithm should be
able to assign a class c C for each observation x0 D0. A very common approach is to have
two separate datasets or to split a single dataset for training and testing. The algorithm learns
its weights and constructs its decision regions on top of the training set and its performance
is evaluated on the formerly unseen test dataset. Example of classification tasks are Optimal
Character Recognition (OCR), text or image classification, fraud or spam detection.
In regression tasks, the target variable is continuous, such that for each pair of observation
x,t, then t is defined in some interval T . Therefore, the goal of regression learners is to minimize
the difference between the outcome of classification toutcome for a given observation x and the
truth value texpected . Some regression tasks are prediction of stock markets, products and real
state prices, credit scoring.

2.3 Sentiment Analysis


Sentiment analysis is an instance of a supervised machine learning task. The goal of
sentiment analysis is to identify the polarity of a text within a given corpora2 corpus, which is
2 corpora is the plural latin form of
2.3. SENTIMENT ANALYSIS 23

a set of documents. of documents. The granularity of the sentiment analysis depends on the
necessity of the problem. For a given task, the polarity might be defined on the corpora or corpus
level, on a document level or even on the sentence or word level. The most common targets of
the sentiment analysis are usually one of sentence or document level.
On the early developments of sentiment analysis research, often the techniques employed
some use of Lexical Resource (LR) in order to achieve classification. Those methods relied
primarily on: a) word-level polarity defined beforehand, using those LRs then as features
for classification. Turney [Tur02] defined an unsupervised method to infer words Semantic
Orientation (SO) based on the difference of the words Pointwise Mutual Information (PMI)
with the word "excellent" (which has a natural positive SO) and the words PMI with the word
"poor" (which has a natural negative SO), such that PMI is defined as:

p(w1 , w2 )
PMI(w1 , w2 ) = log2
p(w1 )p(w2 )
Where p(w) is the probability of occurrence of the word w and p(w1 , w2 ) is the probabil-
ity of words w1 , w2 mutually occurring. The semantic orientation of the whole sentence was then
computed in a similar fashion, based on the phrases difference of PMI with the word "excellent"
and "poor".
Hu and Lius work[HL04] was also one of the first breakthroughs that awakened academic
interest in the field. They defined words that were likely to be opinative based on a seed list and
identifying adjectives through Part-of-Speech Tagging (POS) that were either synonyms of that
seed word (therefore, should have same polarity) or adjectives that were antonyms of a seed
word (should have opposite polarity). The sentence-level classification was then based on a nave
rolling count basis, in which the difference between the number of likely positive words and the
number of likely negative words. The emergence of lexicon-based methods also prompted the
creation of polarity-specific LRs, such as the SentiWordNet[ES06] polarity dictionary.
Lexicon-based algorithms were followed by then new state-of-art classification models
that used cleverer counting mechanisms - primarily Term Frequency-Inverse Document Fre-
quency (TF-IDF). For a word i in a document j, we have that TF-IDF score for that word is
defined as the product of the term-frequency t fi, j - defined as the number of occurrences of word
i in document j - and the inverse document frequency id f - defined as log2 dNfi , where N is the
total number of documents and d fi is the number of documents containing the word i. This
methods accounts both for the occurrence of the word within a document itself and also for the
relevance of the word within the whole corpus, given that some words generally appear more
often than others (prepositions) thus not carrying discriminative value.
The document is then defined as a vector in a Bag-of-words (BOW) model, in which
the document is represented as a unordered collection of N-grams. This vector is defined in
N
the RV space, where V is the cardinality of the vocabulary and N is the size of the N-gram
used to represent the tokens within the BOW model. That representation implicates in very
24 CHAPTER 2. BASIC CONCEPTS

sparse vectors of length V N for every document in the corpus independent of its true length (in
number of words). Even with very sparse vector representations of documents, logistic regression
classifiers and SVMs [Joa98] showed great performance in supervised environments for text
classification, thus becoming the state of the art model.

2.4 Deep Learning


In the first developments of the pattern recognition techniques, researchers usually tackled
their problems by hand-crafting the feature vector with domain-specific features with the aid
of the problem specialists and experts. During the 1980s years that trend started to change
towards using multi-layer neural networks architectures in order for the network to learn data
representation.
Deep learning has its foundations based on Artificial Neural Networks (ANN) archi-
tectures, which ANNs were designed to emulated the humans brain concept for supervised
classification and regression tasks. A basic architecture of an ANN is depicted in figure 2.1 - the
network is built with layers, usually an input layer, a hidden layer and the output layer. In simple
models, the input layer is usually fully connected to the hidden layer (with an arbitrary number
of neurons) and the hidden layer is fully connected with the output layer (with N neurons, one
for each of the N target variables for classification or estimation).

Figure 2.1: Basic multi-layer perceptron - a ANN architecture with fully connected input
and hidden layer.

Source: The author.

In ANNs, the values passed onto the network through the input layer are usually applied
to a non-linear function in the neurons, such that the network produces better generalizations and
2.4. DEEP LEARNING 25

adaptations for more complex scenarios, thus being able to capture non-linear relationships and
correlations. Although there are many branches of deep architectures, most of them are branched
from its parents ANNs architectures. Deep regular neural networks are trained via Stochastic
Gradient Descent (SGD) and have its weights adjusted via backpropagation. Also, the networks
are also modeled either as a feed-forward network or as a Direct Acyclic Graph (DAG), with
multiple hidden layers - each containing a very high number of hidden units.

2.4.1 Convolutional Neural Networks


Convolutional Neural Network (CNN) (which is the main architecture used in this work)
is a widely adopted architecture used in deep learning applications (specially in purely supervised
environments), such as object detection and localization, scene parsing and labeling, speech
recognition, text, image and video classification.
Those networks are built such that it is possible to handle multi-vectors as single inputs,
such as images (vectors defined in the R2 space) and videos (vectors defined in the R3 space).
CNNs are built thorough the concatenation of many different-purpose layers, usually with some
combination of stacked convolutional, pooling and dropout layers.
The core layer of a CNN is the convolutional layer. The layer is responsible for applying
a convolutional filter through a local section of the input feature vector, computing the dot
product between the features of the local section and the activation map values of the current
filter. A CNN might have many convolutional layers of different filter strides in order to capture
local patterns over multiple spans.

Figure 2.2: Illustrative example of a convolutional filter operation over a local section of
a 3-dimensional input feature vector.

Source: The author.

The convolutional layer is often followed by the pooling layer, which performs an
aggregative operation over a subsection of the output of the convolutional layer, reducing
26 CHAPTER 2. BASIC CONCEPTS

drastically the size of the vector representation. The most common pooling operation is max-
pooling, which takes the maximum value of either consecutive section of an arbitrary size N over
the vector or over the entire vector itself.

Figure 2.3: Illustrative example of a pooling operation over an intermediate vector.

Source: The author.

The dropout layer is a fast and computational cheap way to achieve good (empirically)
regularization. In order to prevent overfitting, for each connection or weight in the previous layer
the dropout operation either resets the weight to 0 with a probability p or keeps the weight "as it
is" with a 1 p probability.
Although CNNs are a popular architecture used for many supervised tasks nowadays,
it was already being used in some image classification sets in the beginning of the 2000s.
However, the datasets by then were too small and the computational processing power was weak
[LeC16] such that generative Bayesian algorithms [FFFP07] still outperformed CNN approaches
[LGRN09]. It was not until several years later that with the advance on multipurpose GPU
processors combined with very large-scale image datasets with more than 1 million training
examples spanning 1000 possible categories [RDS+ 15] that breakthrough performance was
reported on classification tasks using CNNs [KSH12].

2.4.2 Word Embedding


The main breakthrough in deep learning techniques is the ability to learn data repre-
sentation instead of focusing on data manufacturing (i.e., feature engineering done a priori)
for classification or analysis. NLP has had a lot of interesting applications since deep learning
gained traction, which made possible the development of word embedding techniques.
Although the main idea was first published by Bengio in 2003 [BDVJ03], word embed-
ding has gained a lot of attention in recent years. A word embedding is defined as a parametrized
2.4. DEEP LEARNING 27

lookup function f : V Rn , where V is a set of words (or vocabulary) and n is the dimensionality
of the word vector representations defined by the function designer. The n parameter is usually
set such that the dimensional space is fairly high, where often 100 n 500, n N.

Figure 2.4: Skip-gram model architecture.

Source: [Ron14]

The model depicted in the figure 2.4 introduced by Mikolov et. al in 2013 [MSC+ 13] is
known as word2vec (short for "word to vector"), specifically the skip-gram model. In this model,
the goal for the network is to predict the context words given only one word. The model is
initialized with random gaussian values for the words initial representations and the weights are
updated as the network is trained. The learned vector representations for words have interesting
properties, such as direction encoding semantic properties and basic vector operations yielding
semantic results - the most used example is that the result of f (king) f (man) + f (woman)
yields a vector very close to f (queen) [MYZ13].
The word vector representations are typically used as input for deep neural networks for
different tasks, such as inputs for convolutional neural networks for text classification [Kim14].
There are a number of different word embeddings [BDVJ03] [PSM14] [MSC+ 13], and literature
has shown that although the performance of the word embedding algorithm has an impact on the
overall performance of the learning algorithm, it is not clear whether a representation is better
than other across any dataset, but rather application-specific [ZW15]. In this work, it will be
pre-trained word2vec representation as inputs and it will be further tuned by the network - i.e.
the vector representations for the words are not static.
29

3
Development

The framework developed in this work has many distinct layers spanning from data
collection through accessing public APIs until the development and subsequent evaluation of a
deep learning-based, convolutional neural network classifier. On this chapter it will be presented
the main libraries and development techniques used along the construction of this work. First,
there will be an introduction to Twitters Streaming API that provides the data collected for
classification on this work. Then, the steps required for data pre-processing and label extraction
based on weakly supervised signals will be discussed, followed by the presentation of the libraries
used for training the word vectors. Lastly it will be shown the general architecture of the model
along with the resources needed for the development of such.

3.1 Data Collection

For the data collection phase, it was used an endpoint provided by Twitter for collecting
real time data known as the Streaming API1 . This endpoint provides a low latency, high data
frequency for developers. This kind of endpoint is different than a REST API, since it requires
to keep a connection that receives and processes data as it comes from the real time stream. The
data received from the stream can be freely manipulated and stored for posterior use as well and
it is possible to filter the stream by a wide range of parameters, such as keywords, language and
location. An application that runs with Twitters Stream API usually follow the flow as depicted
on figure 3.2.

1 https://dev.twitter.com/streaming/overview
30 CHAPTER 3. DEVELOPMENT

Figure 3.1: Twitter Stream API flow.

Source: https://dev.twitter.com/streaming/overview

This API provides a 1% sample of the whole live stream of tweets on the network.
Given the microblogging platforms immense popularity (in January 2016, it was estimated
that there were 303 million tweets per day on an average basis2 , public tweets are very easy
to collect given the streams high throughput, which made the Stream API a very attractive
resource for researches and practitioners to collect this data. Although the Firehose Stream
API3 provides 100% of the public statuses that are published in the network, this endpoint has a
very high financial cost associated (both for bandwith costs and for licensing and permission of
use). Although the Stream API is a highly used resource for data collection (specially for short,
informal texts), it should be noted that studies have shown that Twitters sampling method to
select the 1% for the streaming connection might contain bias depending on the time period of
when the data is being collected [MPLC13] [MPL14] such as when big worldwide events are
occuring (e.g.: during 2014s World Cup), this concern it is not addressed in this work.

3.1.1 Distant Supervision


Since the focus of this work in on supervised learning, this requires that every observation
on the dataset should have a ground truth associated. However, it is unfeasible to annotate a large
amount (several thousands to millions) of text data by hand. Also, alternatives such as Amazon
Mechanical Turk 4 and Clickworker 5 that provide marketplace environments for requesters that
2 Source: http://uk.businessinsider.com/tweets-on-twitter-is-in-serious-decline-2016-2.
3 https://dev.twitter.com/streaming/firehose
4 https://www.mturk.com
5 https://www.clickworker.com
3.1. DATA COLLECTION 31

Table 3.1: Sentiment polarity to emoji mapping.

Sentiment List of smileys faces


Positive ":-)", "(-:", ":)", "(:"
Negative ":-(", ")-:", ":(", "):"

want simple tasks (as the case of categorization of short texts) to be done by humans and workers
that get paid on a task-completion basis, it is not cost effective and therefore should be used in
cases when there are no other viable alternatives.
In order to eliminate the need for annotating by hand the data, it was used a labeling
technique based on pseudo signals called distant supervision [GBH09]. In the case of casual
conversation on the web environment, it is common practice to use emojis (also: "smiley faces"
or "emoticons") as a way to express ones sentiment together with the message. Table 3.1 shows
a mapping of some of the most popular smileys to the sentiment associated to it.

Figure 3.2: Example of tweets with smiley faces.

Source: The author.

Those smileys are used as pseudolabels that propagate their sentiment over the text. This
means that whenever a user inputs one of those smileys together with the text, it is likely that
the user expressed the same sentiment as the smiley used (in the exception of sarcasm). Those
smileys were passed as parameters to track tweets from the Streaming API that had at least one
of the emoticons on table 3.1 and were in English language.
The description above is a high-level abstraction of the implementation of the data
collector, which is a Python6 class that leverages the use of an open-source implementation
python-twitter7 that wraps requests for many of Twitters endpoints for usage within
Python code8 . The full code of the crawler is listed on Appendix A.
6 Widely used programming language for scientific computing. Homepage: https://www.python.org.
7 https://github.com/bear/python-twitter
8 Unless explicitly noted, all of the code for this work was written in Python.
32 CHAPTER 3. DEVELOPMENT

3.2 Data Pre-Processing


After data collection the tweets are very raw data that benefits of pre-processing in order
to reduce its size, complexity and sparsity. This section describes the steps performed to clean
and enrich the data collected.

3.2.1 Tokenization
Although tweets - or any text-based data - are composed of sequences of characters and
words, the Streaming API does not return the tweet as a sequence of tokens, but as a raw string.
The tokenization step is necessary for algorithms to work on top of sequence of words.
Twitter has a very specific grammar and (most of) its users are highly informal in the way
they express themselves. Regular tokenization techniques that splits the string on punctuation
or algorithms such as maximum matching are not good choices to tokenize this particular form
of text data. In this work, it was used an open source implementation9 of a Twitter-specific
tokenization technique [GSO+ 11][OKA10] to generate the tokens.

3.2.2 Normalization
Some of the most common words do not carry much semantic value by themselves such
as prepositions, articles and some conjunctions. Those words that do not have any significance
that will determine the meaning behind the sentence itself are known as stopwords. Each of
these words are removed from the sentence - with the exception of the negation not, which is
very important for sentiment classification given that it changes the polarity of what comes after
the negation. Punctuations and tokens that have less than three (3) characters are also removed.
On Twitter, it is possible to mention another user in the network if one places an "at" (@)
symbol before the others username. Although it could be useful if a tweet contains a mention
or not, it is not very useful when analysing polarity of a single tweet to know which user is
being mentioned. Same logic applies to URLs - although it could be important to know if a user
tweeted a text together with a URL, it is probably not that useful to store which URL is. Then,
every URL and mention found in a tweet is replaced by a special placeholder token as seen in
table 3.2.
Another common step in preprocessing text data is to apply stemming in each word.
Stemming reduces variations, inflectional and derivative forms of a word to its root (or stem)
- e.g.: the stem for "automate", "automates", "automatic" and "automation" is "automat". It
should be noted that no stemming techniques were applied on the tokens on this work, given
that the words on this kind of text are usually already abbreviated or shortened versions of the
9 Source
code of the implementation at: https://github.com/myleott/ark-twokenize-py/
blob/master/twokenize.py.
3.3. WORD EMBEDDING 33

Table 3.2: Mention and hyperlinks normalization to placeholders.

Common prefixes Example Placeholder


URL "@" @POTUS <mnt/>
Mention "http, https, ftp" https://nytimes.com <url/>

original word such that applying stemming does not lead to much improvement on reducing the
vocabulary size.
The remaining smiley faces on each tweet are removed. This step is necessary because
since the label is being assigned to the tweet according to the emoticon found on the sentence,
this would introduce bias in the classifier - e.g.: every time that a sentence with a smiley face
was to be classified it would always yield the same prediction as the polarity of the smiley.
Finally, the sequence is padded to a fixed length of tokens. This step is performed only
for the convolutional neural network classifier, since it requires a fixed length vector for input.
This padding is done by prepending a special token <pad/> until the tweet reaches the expected
sequence length or by removing tokens from the beginning to the end until it is reduced to the
expected length (the fixed length and the reason for the length of choice will be discussed in the
Results chapter).
The techniques described in this section were used to implement an efficient, memory-
friendly reader for the collected text data via Twitters Streaming API. The code is available for
reference in the Appendix B.

3.3 Word Embedding


There are many readily available open source implementations of word embeddings,
including Googles original word2vec implementation in C10 [MSC+ 13] and Stanfords GloVe
(Global Vectors for Word Representation) [PSM14] implementation11 . In this work, it was used
Gensim12 open-source implementation of word2vec to compute the word embeddings. This
implementation uses pre-compiled C code for speedup and also provides an ease of use Python
API for developers.

3.4 Convolutional Neural Network


The main contribution of this work is a Python implementation of a convolutional neural
network for binary sentiment classification based on prior works deep neural networks for Twitter
10 Available at: https://code.google.com/archive/p/word2vec/
11 Available at: https://github.com/stanfordnlp/GloVe
12 Available at: https://radimrehurek.com/gensim/
34 CHAPTER 3. DEVELOPMENT

classification [TWQ+ 14] [TWY+ 14] and for general text classification [Kim14].
The are a number of libraries that provide the requirements needed to model CPU-heavy,
numeric computing based on data flow graphs. The most used libraries are Theano13 , Berkeleys
Caffe14 and Googles TensorFlow15 . However those two libraries are not always easy to use, so
there are also other open source libraries that provide APIs that run on top of numeric, matrix
computing libraries, such as Torch16 and Keras17 . In this work, Keras was used to build the
convolutional neural network model since it provides a Python API and runs on top of Theano.
The model follows close to the baseline reported on [Kim14] [ZW15]. The model
implemented in this work uses pre-trained word embeddings vectors (from word2vec) as weights
to the embedding layer. Although this technique raises the complexity of the model, it has been
show that it improves the performance of the model [Kim14].

Figure 3.3: High level abstraction of the implemented convolutional neural network
model.

Source: The author.

The model has four convolutional layers at the same depth. Each convolutional layer has
13 http://deeplearning.net/software/theano/
14 http://caffe.berkeleyvision.org
15 http://tensorflow.org
16 http://torch.ch
17 http://keras.io
3.5. BASELINE MODELS 35

its own filter length size of 4, 5, 6 e 7. Using multiple filters of different sizes allows the network
to capture different features at different lengths of each feature vector. 1-max pooling is then
performed on the output of each convolutional filter, the output of the pooling layer is merged and
passed through a dropout layer with 0.5 probability of keeping each feature. The output of the
dropout layer is then fully connected to the 2-dimensional softmax output layer for classification.
The model is trained with backpropagation, SGD for optimization and categorical cross-entropy
for loss. The full model architecture is presented on figure 3.3. The implementation of the is
depicted on Appendix C.

3.5 Baseline models


The performance of the implemented CNN model is tested against two baseline models,
Logistic Regression (LogReg) and support vector machines. Feature engineered methods have
achieved good performances on many different works [BF10] [MKZ13] [GBH09] that used
traditional supervised classifiers. In this work, the implemented model will have its performance
compared with a logistic regression classifier and a support vector machine with linear kernel
(linear kernels are proven to be a good choice of kernel for text classification [Joa98]).
The LogReg and SVM classifiers do not use word embeddings as input vectors for
classification, but sparse TF-IDF vectors built from the corpus of Twitter data. Although the
main representation is different, the same steps of preprocessing (tokenization, stopword removal,
normalization, smiley faces removal) applied to the embedded word vectors are used for the
TF-IDF vectorization as well. For the SVM and LogReg models and for the TF-IDF sentence
vectorization, it was used the widely used Scikit-Learn18 library, which is an open source Python
library for machine learning applications development.

18 http://scikit-learn.org/stable/
37

4
Results and Analysis

In order to validate the effectiveness of the model and techniques described, the model
is tested against some traditional, baseline supervised machine learning classifiers - a logistic
regression classifier and a support vector machine. This chapter describes the environment in
which the tests were performed, the collected dataset and the performance for the aforementioned
classifiers.

4.1 Environment
Unless explicitly noted otherwise, every test describe in this chapter was run in the
environment described below.

Hardware Environment

Processor 2.6 GHz Intel Core i5

Memory 8 GB 1600 MHz DDR3

Software Environment

Operating System OS X El Capitan 10.11.5

Python Python 2.7.12 :: Continuum Analytics, Inc

gensim 0.12.4
keras 1.1.0
nltk 3.2.1
numpy 1.11.2
scikit-learn 0.18
tensorflow 0.11.0rc0
theano 0.9.0
38 CHAPTER 4. RESULTS AND ANALYSIS

4.2 Exploratory Analysis


The Twitter data collector described in the chapter 3 was active from October 24, 2016
until November 6, 2016. It stored 1,126,000 tweets with pseudo on disk in 1126 JSON files
(121MB of total data) with 1,000 labeled tweets each. The collected dataset has 785,037 tweets
with positive label and 340,963 twets labeled as negative (via smiley face mapping as described
on table 3.1). Since the dataset is imbalanced, only the first 340,963 positive tweets are used,
therefore only 681,926 tweets were used for training and evaluation.
It is important to determine the length of the sequence padding for the convolutional
neural network model through inspection of the dataset - if the length of the padding is too small
there might be a lot of loss of data, however if the padding is length of the padding is too large it
might introduce unnecessary pad tokens and dilute the relative frequency of the other tokens in
the vocabulary. The length for sequence padding was chosen from observing the distribution of
the lengths and choosing an adequate sequence length that covered the longest sequence that did
not qualify as an outlier.

Figure 4.1: Distribution of the length of tweets with outliers identified through the
interquartile range method.

Source: The author.

A simple technique to detect outliers is the Interquartile Range (IQR) based method.
The IQR is defined as difference between the third quartile (Q3 ) and the first quartile (Q1 ). A
potential outlier is defined as an observation that falls either below Q1 1.5 IQR or above
Q3 + 1.5 IQR. The box plot visualization on figure 4.1 helps visualize observations that fall
outside of this valid range.
4.3. WORD VECTORS 39

Through that simple rule it is possible to determine that tweets that have length of at least
19 tokens should be considered as outliers - the length of sequence padding was then defined as
20. Then, every tweet of length L, L > 20 that has more than 20 tokens has its first L 20 tokens
stripped from the sequence and every tweet of length L, L < 20 has 20 L special padding tokens
<pad/> appended to the beginning of the sequence (prepend).

4.3 Word Vectors


The embedded word vectors were trained with word2vec [MSC+ 13] algorithm through
gensims open-source implementation as described in the previous chapter. Once the focus of
this work is in the CNN for classification and not on the word2vec network itself, the parameters
were not fine-tuned for the task. However, the main hyperparameter that was manipulated is
the dimensionality of the vectors. Each trained CNN model has a different size for the feature
vector - three different models were trained using size 20, 100 and 200. The remaining default
parameters used in the word2vec models are described on table 4.1.
Table 4.1: word2vec parameters table.

Parameter Short description Value


Window Context window between current and predicted word 3
Minimum count Minimum frequency of a word to be included in the vocabulary 3
Epochs Number of iterations over the corpus 5

4.4 Classifiers
Three main models were compared: the main CNN architecture developed in this work,
a logistic regression supervised and a support vector machine with linear kernel classifier. This
section will provide a better description of the developed models complexity and an evaluation
of its performance in contrast with the other two baseline classifiers.

4.4.1 CNNs Model Complexity


Three different CNN models were trained for evaluation. All of the three networks
developed share the same architecture described as in figure 3.3, each has its own different
embedding size for the word vectors: 20, 100, 200 - in this chapter, each of the networks will
be referred as CNN20 , CNN100 and CNN200 while the baseline classifiers will be referred as
LogReg and SVM. The parameters in common within all of the CNN implementations are
described in table 4.2.
40 CHAPTER 4. RESULTS AND ANALYSIS

Table 4.2: Common parameters for every CNN implemented model.

Parameter Short description Value


Sequence length Length of input sequence of word vectors 20
Length of filters Extension of convolutional filters [4, 5, 6, 7]
Number of feature maps Dimensionality of convolutional layer output 100
Activation function Activation function of convolutional units ReLU
Classification function Function used for classification at the final layer Softmax
Model objective function Loss function for training model Categorical Crossentropy
Optimizer Optimization for objective function SGD

Table 4.3: Description of total training time for each CNN.

CNN model variation Word Vector Size Total training time


CNN20 20 50 minutes 41 seconds
CNN100 100 3 hours 32 minutes 38 seconds
CNN200 200 4 hours 29 minutes 03 seconds

Once training each CNN architecture is computationally expensive, it is sometimes


required a great deal of time to train each model. For reasons of hardware constraints, in this
work each CNN was trained only once over 10 epochs (i.e.: 10 iterations over the Twitter corpus)
on a 70-30 split for training and test. This is a sub-optimal approach which does not take
advantage of more statistically correct approaches such as performing K-fold cross validation in
order to extract better metrics for evaluation. The figure ?? shows the distribution of the training
time per epoch of each CNN model according to the size of the word vector hyperparameter; the
table 4.3 shows the total training time for each model.

The increasing number of parameters is mainly due to the fact that the word vectors
are passed as initial weights to the embedding layers. This technique allows the word vectors
to be refined and have its values updated as the model is trained. This approach is not always
necessary and in some experiments settings just using the static pre-trained word vectors learned
a priori yields almost as good performance as using non-static vectors [ZW15]. The table 4.4
below shows how each layer influences on the number of the parameters on each model.
4.4. CLASSIFIERS 41

Table 4.4: Number of parameters for each layer of the CNN models.

Model Embedding Layer Convolutional Layer Output Layer Total


CNN20 1,293,720 44,400 802 1,338,922
CNN100 6,468,600 220,400 802 6,689,802
CNN200 12,937,200 440,400 802 13,378,402

4.4.2 Performance Evaluation

In this section, the model purposed in the work has its performance compared with two
of the most used traditional machine learning algorithms in text classification: logistic regression
and support vector machines. The figure 4.2 shows how the logarithmic loss of each model
decreases over the epochs. Also, the figure 4.3 show how the training accuracy increases over the
training epochs. It is observed that the higher the dimensionality of the word embedding, smaller
the model logarithmic loss and higher the training accuracy. The results on the evaluation of
the CNN model variations and on the baseline models are listed on table 4.5. The models were
evaluated on precision, recall, macro F-score and accuracy. For the baseline models, the mean
and standard deviation across 3-fold cross validation as (mean, std).

Figure 4.2: History of models logarithmic loss over 10 epochs of training.

Source: The author.


42 CHAPTER 4. RESULTS AND ANALYSIS

Figure 4.3: History of models training accuracy over 10 epochs of training.

Source: The author.

Table 4.5: Results of the CNN variations developed compared against baseline classifiers.

Model Precision Recall F-Score Accuracy


CNN20 0.782 0.773 0.772 0.774
CNN100 0.802 0.800 0.800 0.800
CNN200 0.813 0.809 0.808 0.809
Logistic Regression (.816, .001) (.816, .001) (.816, .001) (.816, .001)
SVM with Linear Kernel (.837, .001) (.837, .001) (.837, .001) (.837, .001)

From the results of the models described on table 4.5 it is easy to note that the size of
the embedding for the word vectors influences greatly on the accuracy of the model - there is an
increase of 3.5 percentual points on accuracy by just increasing the size of the dimensionality.
However, the improvement on the model classification comes at the cost of an increase in total
number of parameters and on training time, as pointed on table 4.3.
The CNN200 variation performs very comparably to the logistic regression model and
it is closely aligned with previous work [Kim14], although the SVM with linear kernel model
still outperforms reasonably the other models evaluated. The main advantages of the SVM
and LogReg models in this scenario are the ease of implementation and data preparation for
classification. However, as the dataset grows larger, it is expected that the CNN model rivals and
eventually beats the SVM benchmark. Also, the need for traditional models to work with sparse
matrices makes it hard to scale the model.
43

5
Conclusion

In this work, it was presented a full end-to-end framework to collect short and informal
texts from the web - specifically from the microblogging social network Twitter, via connection to
its real-time Streaming API - and to evaluate the performance of a deep learning-based classifier
for classifying tweets as either positive or negative.

One of the major drawbacks in using a convolutional neural network for classification
is the complexity of the model. The number of parameters to estimate is fairly high even for a
shallow network that does not run in a production environment - leading to a very high training
time that does not scale on CPU, even for such a small dataset. In order for one to take full
advantage of a CNN model, it is required to leverage the use of GPUs to enhance the training
speed of a CNN by a factor of a 100 [SSB10].

Also, this kind of approach does not seem to be the necessarily best choice for small
datasets. In a dataset of the order of thousands a SVM with a linear kernel classifier has the
highest presented accuracy while using a simple representation of the data - a uni-bigram
bag-of-words (BOW) model with TF-IDF vectorization.

The main contribution of this work, however, is the study of the presented modes
performance for classification on this particular kind of text data, based on previous work in
the field either in general text classification [Kim14] or in Twitter itself [TWQ+ 14] [TWY+ 14].
While some of the most accurate classifiers are built on top of extensive feature engineering
[MKZ13], the presented model shifts the focus from letting the model itself learn the best
possible inner representation for the data.

This approach is highly sensitive to the amount of training data for the pre-trained
unsupervised word vectors and for the classifier as well. Also, it is expected that a CNN-based
approach achieves state-of-the-art performance as the datasets grows larger and it is highly
scalable on production environments that leverages the use of GPUs and does not need to use a
very large sparse matrix for data representation.
44 CHAPTER 5. CONCLUSION

5.1 Future Work


The presented model is reasonably simple and shallow compared to many other works in
the area. A better fine-tuning of the models hyperparameters and the addition of more stacked
convolutional layers could be inspected to analyze how the performance is impacted. Also,
the use of GPUs allows not only to further speed up the training of the models but also allows
to enhance the trial-and-error cycle of evaluating a minor tweak or a major change in a deep
network, which speeds up the research process as a whole.
The literature on binary sentiment classification is vast and the problem is nearing closer
to a "solved" status, in which classifiers are fast, scalable and achieve performance very close to
human classification - which is one the major goals for the machine learning as a field itself. A
closely related problem but still not fully explored is how one could leverage the amount of text
data in English to train a text classifier in other target language.
Although there is a lot of work using Recurrent Neural Network (RNN) architectures
to achieve impressive results on machine translation [JSL+ 16] and some literature on cross-
language text classifiers [PS10] [Wan09], there seems to be no study on how one could combine
deep networks such as CNNs and RNNs to perform both translation and classification to a target
language.
45

References

[BDVJ03] Yoshua Bengio, Rjean Ducharme, Pascal Vincent, and Christian Jauvin. A neural
probabilistic language model. journal of machine learning research, 3(Feb):1137
1155, 2003.

[BF10] Luciano Barbosa and Junlan Feng. Robust sentiment detection on twitter from
biased and noisy data. In Proceedings of the 23rd International Conference on
Computational Linguistics: Posters, pages 3644. Association for Computational
Linguistics, 2010.

[Dom12] Pedro Domingos. A few useful things to know about machine learning. Commun.
ACM, 55(10):7887, October 2012.

[ES06] Andrea Esuli and Fabrizio Sebastiani. Sentiwordnet: A publicly available lexical
resource for opinion mining. In Proceedings of LREC, volume 6, pages 417422.
Citeseer, 2006.

[FFFP07] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from
few training examples: An incremental bayesian approach tested on 101 object
categories. Computer Vision and Image Understanding, 106(1):5970, 2007.

[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in
preparation for MIT Press, 2016.

[GBH09] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.

[GL14] Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.s
negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014.

[GSO+ 11] Kevin Gimpel, Nathan Schneider, Brendan OConnor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 4247.
Association for Computational Linguistics, 2011.

[HL04] Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceed-
ings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, KDD 04, pages 168177, New York, NY, USA, 2004. ACM.
46 REFERENCES

[Joa98] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. In European conference on machine learning, pages
137142. Springer, 1998.

[JSL+ 16] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda B. Vigas, Martin Wattenberg, Greg Corrado, Mac-
duff Hughes, and Jeffrey Dean. Googles multilingual neural machine translation
system: Enabling zero-shot translation. CoRR, abs/1611.04558, 2016.

[Kim14] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint
arXiv:1408.5882, 2014.

[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems
25, pages 10971105. Curran Associates, Inc., 2012.

[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,
521(7553):436444, 2015.

[LeC16] Yann LeCunn. Deep learning. https://indico.cern.ch/event/


510372/attachments/1245509/1840815/lecun-20160324-cern.
pdf, March 2016.

[LGRN09] Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional
deep belief networks for scalable unsupervised learning of hierarchical represen-
tations. In Proceedings of the 26th annual international conference on machine
learning, pages 609616. ACM, 2009.


[MKB+ 11] Tomas Mikolov, S. Kombrink, L. Burget, J. Cernock, and S. Khudanpur. Extensions
of recurrent neural network language model. In 2011 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 55285531, May
2011.

[MKZ13] Saif M. Mohammad, Svetlana Kiritchenko, and Xiaodan Zhu. Nrc-canada: Building
the state-of-the-art in sentiment analysis of tweets. In Proceedings of the seventh
international workshop on Semantic Evaluation Exercises (SemEval-2013), Atlanta,
Georgia, USA, June 2013.

[MPL14] Fred Morstatter, Jrgen Pfeffer, and Huan Liu. When is it biased?: assessing the rep-
resentativeness of twitters streaming api. In Proceedings of the 23rd International
Conference on World Wide Web, pages 555556. ACM, 2014.
REFERENCES 47

[MPLC13] Fred Morstatter, Jrgen Pfeffer, Huan Liu, and Kathleen M Carley. Is the sample
good enough? comparing data from twitters streaming api with twitters firehose.
arXiv preprint arXiv:1306.5204, 2013.

[MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis-
tributed representations of words and phrases and their compositionality. In Ad-
vances in neural information processing systems, pages 31113119, 2013.

[MYZ13] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in
continuous space word representations. In HLT-NAACL, volume 13, pages 746751,
2013.

[OKA10] Brendan OConnor, Michel Krieger, and David Ahn. Tweetmotif: Exploratory
search and topic summarization for twitter. In ICWSM, 2010.

[Ola14] Christopher Olah. Deep learning, nlp, and representations. http://colah.


github.io/posts/2014-07-NLP-RNNs-Representations/, July
2014.

[PS10] Peter Prettenhofer and Benno Stein. Cross-language text classification using struc-
tural correspondence learning. In Proceedings of the 48th Annual Meeting of
the Association for Computational Linguistics, pages 11181127. Association for
Computational Linguistics, 2010.

[PSM14] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global
vectors for word representation. In EMNLP, volume 14, pages 153243, 2014.

[PVG+ 11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-


del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:28252830, 2011.

[RDS+ 15] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexan-
der C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV), 115(3):211252, 2015.

[Ron14] Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738,
2014.

[SSB10] Dominik Scherer, Hannes Schulz, and Sven Behnke. Accelerating large-scale con-
volutional neural networks with parallel graphics multiprocessors. In International
Conference on Artificial Neural Networks, pages 8291. Springer, 2010.
48 REFERENCES

[SVL14] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with
neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and
K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,
pages 31043112. Curran Associates, Inc., 2014.

[Tur02] Peter D Turney. Thumbs up or thumbs down?: semantic orientation applied to


unsupervised classification of reviews. In Proceedings of the 40th annual meet-
ing on association for computational linguistics, pages 417424. Association for
Computational Linguistics, 2002.

[TWQ+ 14] Duyu Tang, Furu Wei, Bing Qin, Ting Liu, and Ming Zhou. Coooolll: A deep
learning system for twitter sentiment classification. In Proceedings of the 8th
International Workshop on Semantic Evaluation (SemEval 2014), pages 208212,
2014.

[TWY+ 14] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In ACL (1),
pages 15551565, 2014.

[Wan09] Xiaojun Wan. Co-training for cross-lingual sentiment classification. In Proceedings


of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language Processing of the AFNLP: Volume
1-Volume 1, pages 235243. Association for Computational Linguistics, 2009.

[ZW15] Ye Zhang and Byron C. Wallace. A sensitivity analysis of (and practitioners guide
to) convolutional neural networks for sentence classification. CoRR, abs/1510.03820,
2015.
Appendix
51

A
TweetCrawler

This appendix contains the source code used for connecting to Twitters Streaming API,
using the smiley faces in table 3.1 as keywords for the track parameter and English as target
language. To avoid storing every tweet in memory (which comes with metadata embedded), the
class uses a buffer that whenever it reaches its full capacity (default set to 1000 tweets) dumps
its content on disk and clears the buffer.
# coding=utf-8
import sys
import codecs
import twitter
import ujson as json
import cPickle as pickle
import time
from datetime import datetime
from random import choice
from string import hexdigits

POS = [:), (:, :-), (-:]


NEG = [:(, ):, :-(, )-:]

class TweetCrawler:
LANG = "lang"
TEXT = "text"
ENGLISH = "en"
NEW_LINE = "\n"

def __init__(self, key_file_path, keywords=None, buffer_limit=100):


self.key_file_path = key_file_path
self.keywords = keywords
self.stream = None
self.buffer = []
self.buffer_limit = buffer_limit

def trim(self, text):


52 APPENDIX A. TWEETCRAWLER

text = text.replace(\n, ).strip()


label = +1 if any(text.find(i) != -1 for i in POS) else -1
return json.dumps({"text": text, "label": label})

def __is_text_valid(self, text):


return text != None and text != "" and len(text) > 5

def __is_tweet_valid__(self, tweet):


if tweet.has_key(TweetCrawler.LANG) and tweet.has_key(TweetCrawler.
TEXT):
lang = tweet[TweetCrawler.LANG]
text = tweet[TweetCrawler.TEXT]
return lang == TweetCrawler.ENGLISH and \
self.__is_text_valid(text)

return False

def get_keys(self):
with open(self.key_file_path, "r") as keyfile:
keys = [key.strip() for key in keyfile]
return keys

def get_api_connection(self):
keys = self.get_keys()
return twitter.Api(keys[0], keys[1], keys[2], keys[3])

def init_stream(self):
api = self.get_api_connection()
self.stream = api.GetStreamFilter(track=self.keywords)

def __random_prefix__(self):
return .join([choice(hexdigits) for i in xrange(6)])

def is_buffer_full(self):
return len(self.buffer) >= self.buffer_limit

def get_file_name(self):
now = time.time()
dt = datetime.fromtimestamp(now)
ts = "%d%d%dT%d%d" % (dt.year, dt.month, dt.day, dt.hour, dt.minute
)
fname = self.__random_prefix__() + "_" + ts + "_tweets.json"
return fname

def dump_buffer(self, path):


fname = path + self.get_file_name()
with codecs.open(fname, w, "utf-8") as f:
53

for tweet in self.buffer:


tweet = self.trim(tweet)
f.write(tweet + TweetCrawler.NEW_LINE)
self.buffer = []

def crawl_tweets(self, keywords=None, path=./data/):


if keywords is not None:
self.keywords = keywords
self.init_stream()
for i, tweet in enumerate(self.stream):
print "Tweet: %d" % i
if self.__is_tweet_valid__(tweet) and not self.is_buffer_full()
:
self.buffer.append(tweet[TweetCrawler.TEXT])
print Valid on buffer: %d % len(self.buffer)
elif self.is_buffer_full():
print "Dumping...\n...\n"
self.dump_buffer(path)
else:
continue

def main():
keywords = POS + NEG
limit = sys.argv[-1]
crawler = TweetCrawler(./crawler/keys.txt,
keywords=keywords,
buffer_limit=1000)
crawler.crawl_tweets()

if __name__ == __main__:
main()
55

B
Preprocessing

B.1 Parser
The following script is the implementation used in this word for preprocessing a tweet.
It tokenizes the tweet, filters unwanted words (stopwords, punctuation, short tokens), removes
smiley faces, applies placeholders for normalization and pads the sentence.
import ujson as json
import re
from twokenize import tokenizeRawTweetText
from nltk.corpus import stopwords
from string import digits, punctuation
from hyper_params import SEQUENCE_LENGTH
import re

# "not" shouldnt be a stop word, given its a polarity changer


EN_STOP = stopwords.words(english)
EN_STOP.remove(not)

urlrxp = re.compile(
r^(?:http|ftp)s?:// # http:// or https://
r(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A
-Z0-9-]{2,}\.?)| #domain...
rlocalhost| #localhost...
r\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) # ...or ip
r(?::\d+)? # optional port
r(?:/?|[/?]\S+)$, re.IGNORECASE
)

# Should remove the emojis from the text,


# so it wont bias the classifier.
POS = [:), (:, :-), (-:]
NEG = [:(, ):, :-(, )-:]
EMOJIS = POS + NEG
BAD_CHARS = digits + punctuation
56 APPENDIX B. PREPROCESSING

CUT = 3
MENTION = "<mnt/>"
URL = "<url/>"
PAD = "<pad/>"

def is_url(token):
return urlrxp.match(token) is not None

def only_digits_or_punctuation(token):
return all(i in BAD_CHARS for i in token)

def is_too_short(token):
return len(token) < CUT

def pad(sequence):
len_seq = len(sequence)
if len_seq > SEQUENCE_LENGTH:
return sequence[-SEQUENCE_LENGTH:]
padded = [PAD] * SEQUENCE_LENGTH
padded[-len_seq:] = sequence
return padded

def should_be_kept(token):
token = token.lower()
return not (token in EN_STOP or \
token in EMOJIS or \
only_digits_or_punctuation(token) or \
is_too_short(token))

def placeholder_and_lower(token):
if token[0] == "@":
return MENTION
elif is_url(token):
return URL
else:
return token.lower()

def parse(text, should_pad=True):


tokens = tokenizeRawTweetText(text)
tokens = filter(should_be_kept, tokens)
tokens = map(placeholder_and_lower, tokens)
return pad(tokens) if should_pad else tokens
B.2. CORPUS READER 57

B.2 Corpus Reader


The following implementation TweetCorpusReader a Python class that reads the
raw text data stored on disk and returns normalized and processed text data with the labels
associated for each tweet.
import ujson as json
import hyper_params, logging, os, sys, time
from tweet_parser import parse
from keras.preprocessing import sequence
from hyper_params import SEQUENCE_LENGTH
import numpy as np

TWEET_SUFFIX = "tweets.json"
TEXT = "text"
LABEL = "label"
PAD = "<pad/>"

class TweetCorpusReader(object):
"""
Class for reading efficiently the Twitter corpora.
Reads and parses tweets on the fly, returning
only those that have at least two tokens.
"""
def __init__(self, data_path, text_only=True, w2i=None, debug=False,
should_pad=True):
super(TweetCorpusReader, self).__init__()
self.data_path = data_path
self.text_only = text_only
self.embedding = False
self.w2i = w2i
self.unk = None if w2i is None else len(self.w2i)
self.debug = debug
self.should_pad = should_pad

def load_and_process(self, doc):


doc = json.loads(doc)
text = parse(doc[TEXT], should_pad=self.should_pad)
label = doc[LABEL]
return (text, label)

def set_w2i(self, w2i):


self.w2i = w2i

def __pad__(self, sequence):


sequence = [token for token in sequence if token in self.w2i]
len_seq = len(sequence)
58 APPENDIX B. PREPROCESSING

if len_seq > SEQUENCE_LENGTH:


return sequence[-SEQUENCE_LENGTH:]
padded = [PAD] * SEQUENCE_LENGTH
padded[-len_seq:] = sequence
padded = [self.w2i[token] for token in padded]

return padded

def __get_output__(self, label):


return np.array([1.0, 0.0]) if label == 1 else np.array([0.0, 1.0])

def __iter__(self):
json_files = []
for json_file in os.listdir(self.data_path):
if json_file.endswith(TWEET_SUFFIX):
json_files.append("{0}/{1}".format(self.data_path,
json_file))
for json_file in json_files:
with open(json_file, "r") as f_in:
tweets_in_file = [self.load_and_process(doc) for doc in
f_in]
if self.w2i is None:
for tweet, label in tweets_in_file:
if len(tweet) > 1:
yield tweet if self.text_only else (tweet,
label)
else:
for tweet, label in tweets_in_file:
if len(tweet) > 1:
indexes = self.__pad__(tweet)
yield (np.array(indexes), self.__get_output__(
label))
if self.debug:
break
59

C
SingleLayerCNN

This appendix contains the Python code for the implementation provided for building the
convolutional neural network model for sentiment classification on short texts.
from gensim.models.word2vec import Word2Vec
from keras.models import Model
from keras.layers import Dense, Dropout, Embedding, Flatten, Input, Merge,
Convolution1D, MaxPooling1D
from types import ListType

import hyper_params
import ujson as json
from time import asctime

class SingleLayerCNN(object):
"""docstring for SingleLayerCNN."""
def __init__(self, seq_len, emb_dim, filter_len, feature_maps,
w2v_model_path):
super(SingleLayerCNN, self).__init__()
self.seq_len = seq_len
self.emb_dim = emb_dim
self.filter_len = filter_len
self.feature_maps = feature_maps
self.w2v = None
self.w2v_model_path = w2v_model_path
self.model = self.__build_cnn__(w2v_model_path)
assert emb_dim == self.w2v.vector_size, \
word2vec expects vector_size=%d, actual=%d % (self.w2v.
vector_size, emb_dim)

def describe_params(self, hist):


return {
"seq_len": self.seq_len,
"emb_dim": self.emb_dim,
"filter_len": self.filter_len,
"feature_maps": self.feature_maps,
60 APPENDIX C. SINGLELAYERCNN

"w2v": self.w2v_model_path,
"hist": hist
}

def save_model(self, hist):


now = asctime().replace( , _)
prefix = "cnn_{0}".format(now)
self.model.save(../models/cnn/{0}.hd5.format(prefix))
with open(../models/cnn/{0}.json.format(prefix), w) as f_config
:
metadata = self.describe_params(hist)
f_config.write(json.dumps(metadata))

def __build_conv_layer__(self):
layer_name = "embedding_input"
emb_input = Input(shape=(self.seq_len, self.emb_dim), dtype=int32
, name=layer_name)
conv_layers = []
assert type(self.filter_len) is ListType, \
"filter_len=%s is not a list" % (str(self.filter_len))

for filter_length in self.filter_len:


conv = Convolution1D(
nb_filter=self.feature_maps,
filter_length=filter_length,
activation=relu)(emb_input)
conv = MaxPooling1D(self.seq_len - filter_length)(conv)
conv = Flatten()(conv)
conv_layers.append(conv)

conv = Merge(mode=concat)(conv_layers) if len(conv_layers) > 1


else conv_layers[0]
conv = Model(input=emb_input, output=conv, name=conv_layer)
return conv

def __build_cnn__(self, w2v_model_path):


conv = self.__build_conv_layer__()
main_input = Input(shape=(self.seq_len,), dtype=int32, name=
main_input)
self.w2v = Word2Vec.load(w2v_model_path)
vocab_size = len(self.w2v.vocab)

embedding = Embedding(
input_dim=vocab_size,
output_dim=self.emb_dim,
input_length=self.seq_len,
61

weights=[self.w2v.syn0])(main_input)

conv = conv(embedding)
conv = Dropout(.5)(conv)
conv = Dense(2, activation="softmax")(conv)
model = Model(input=main_input, output=conv)
model.compile(loss=categorical_crossentropy, optimizer=sgd,
metrics=[accuracy])

return model

def word2index(self):
return {word: i for i, word in enumerate(self.w2v.index2word)}

Vous aimerez peut-être aussi