Vous êtes sur la page 1sur 75

http://lib.ulg.ac.be http://matheo.ulg.ac.

be

Tweets sentiment and their impact on stock market movements

Auteur : Abbes, Hakim


Promoteur(s) : Ittoo, Ashwin
Faculté : HEC-Ecole de gestion de l'ULg
Diplôme : Master en ingénieur de gestion, à finalité spécialisée en Supply Chain Management and
Business Analytics
Année académique : 2015-2016
URI/URL : http://hdl.handle.net/2268.2/1323

Avertissement à l'attention des usagers :

Tous les documents placés en accès ouvert sur le site le site MatheO sont protégés par le droit d'auteur. Conformément
aux principes énoncés par la "Budapest Open Access Initiative"(BOAI, 2002), l'utilisateur du site peut lire, télécharger,
copier, transmettre, imprimer, chercher ou faire un lien vers le texte intégral de ces documents, les disséquer pour les
indexer, s'en servir de données pour un logiciel, ou s'en servir à toute autre fin légale (ou prévue par la réglementation
relative au droit d'auteur). Toute utilisation du document à des fins commerciales est strictement interdite.

Par ailleurs, l'utilisateur s'engage à respecter les droits moraux de l'auteur, principalement le droit à l'intégrité de l'oeuvre
et le droit de paternité et ce dans toute utilisation que l'utilisateur entreprend. Ainsi, à titre d'exemple, lorsqu'il reproduira
un document par extrait ou dans son intégralité, l'utilisateur citera de manière complète les sources telles que
mentionnées ci-dessus. Toute utilisation non explicitement autorisée ci-avant (telle que par exemple, la modification du
document ou son résumé) nécessite l'autorisation préalable et expresse des auteurs ou de leurs ayants droit.
TWEETS SENTIMENT AND THEIR IMPACT
ON STOCK MARKET MOVEMENTS

Jury : Dissertation by
Promoter : Hakim ABBES
Ashwin ITTOO With a view to obtaining a Master
Readers : degree in Business Engineering
Michael SCHYNS specialized in Supply Chain
Anne-Sophie
Sophie HOFFAIT Management and Business Analytics
Academic year 2015/2016
Acknowledgements

I would like to thank my thesis committee members and especially Dr. Ashwin Ittoo
for his availability, his advises and ideas which have been an invaluable resource during the
entirety of the research process.

I am very thankful to Mr. Alessandro Beretta for his expertise in statistical and
financial subjects and for the fruitful discussions we had.

I would also like to express my gratitude to my fellow students with whom I could
exchange ideas and opinions; they spurred my motivation and my creativity.

Finally, I would like to thank my family and relatives who have been supporting me
since the beginning and helped me in numerous ways for this thesis.
Table of contents
1. Introduction .......................................................................................................................... 3

1.1 Structure ............................................................................................................................ 4


1.2 Contributions .................................................................................................................... 5
2 Literature Review .................................................................................................................. 7
2.1 Market Covered ................................................................................................................ 7
2.2 Variables ........................................................................................................................... 7
2.2.1 Dependent Variables .................................................................................................. 7
2.2.2 Independent Variable ................................................................................................. 9
2.3 Machine learning ............................................................................................................ 11
2.3.1 Unsupervised learning ............................................................................................. 11
2.3.2 Supervised learning.................................................................................................. 12
2.4 Classification .................................................................................................................. 12
2.4.1 Support Vector Machine .......................................................................................... 14
2.4.2 Naive Bayes ............................................................................................................. 15
2.5 Correlation and prediction .............................................................................................. 16
2.5.1 Granger causality analysis ....................................................................................... 16
2.5.2 Vector Auroregression ............................................................................................. 17
2.5.3 Self-Organizing Fuzzy Neural Network(SOFNN) .................................................. 17
2.6 Forecasting performance evaluation .............................................................................. 18
2.7 Limitations ...................................................................................................................... 18
3 Methodology ........................................................................................................................ 21
3.1 Approach......................................................................................................................... 21
3.2 Data gathering ................................................................................................................. 22
3.2.1 Stock data ................................................................................................................. 22
3.2.2 Twitter data .............................................................................................................. 23
3.3 Sentiment analysis .......................................................................................................... 24
3.3.1 Tweets cleaning ...................................................................................................... 25
3.3.2 Tweets classification ................................................................................................ 25
3.4 Data normalization.......................................................................................................... 29
3.5 Time series ...................................................................................................................... 30
3.6 Modeling (FTSE100 index level) ................................................................................... 31
3.6.1 Plot visualization ..................................................................................................... 31

1
3.6.2 Linear regression ...................................................................................................... 33
3.6.3 Granger causality analysis ....................................................................................... 38
3.6.4 Logistic regression ................................................................................................... 39
3.7 Results (FTSE100 index level) ....................................................................................... 42
3.8 Modeling (company stock level) .................................................................................... 45
3.6.1 Plot visualization ..................................................................................................... 46
3.6.2 Linear regression ...................................................................................................... 48
3.6.3 Granger causality analysis ....................................................................................... 51
3.6.4 Logistic regression ................................................................................................... 51
3.7 Results (company stock level) ........................................................................................ 52
4 Conclusion and future work ............................................................................................... 54

Table of appendices
Appendices ................................................................................................................................. I
Terms and acronyms .............................................................................................................. IV
References ................................................................................................................................ V

2
1. Introduction

In the past decade, Twitter1 has experienced a tremendous growth worldwide in line
with the success of social networks. It is no surprise many researchers started to investigate
ways of using this microblogging platform to find new applications in all kind of domains.
Online sentiment tracking has notably gotten very popular in social and psychological
analysis but the application range is actually wider. In particular, some authors took an
interest in the possibility of predicting financial markets.

To deter tweets' sentiment, one has to rely on text classification techniques. This
subject has been considerably widened over the last years because of the multiplication of
documents available to companies. Their need for understanding their customers has been
decupled to provide them with products and services that meet and induce demand. Many
machine learning techniques can be used to assign classes to text document. Among them, the
most famous are Support Vector Machine (SVM) and Naive-Bayes classification. Unlike
clustering used for unsupervised learning, these techniques allow for classification by training
a computer on annotated datasets so that texts can be automatically labeled.

Text mining involves a corpus of documents. In this context, authors mainly focus on
tweets because of their characteristics: they are short and represent live situations better. On
top of that, it is easier to crawl them than documents from other social networks. An online
Application Programming Interface (API) is made available for any developer willing to use
Twitter data. Moreover, tweets usually contain hashtags and symbols that facilitate the search
for relevant posts. Once a significant sample has been collected, authors apply different
machine learning techniques, which they can validate through techniques like cross validation
and test with accuracy metrics such as F1-measure. Even if no classifier is perfect, they
currently achieve satisfying results allowing for safe further analysis.

The other data of interest are of course stock data which is largely available online.
However authors nuance their analyses by selecting different levels in their methodology. The
most usual one is taking an index composed of several company stocks that can also be

1
http://twitter.com

3
predicted individually. Besides, the market can also be broken down in sectors like agriculture
or telecommunications as the tweets related are more centered than just researching tweets
about the whole index. The type of data can vary, depending on the regression processed in
the next step, it is sometimes better to try to predict closing prices, returns or solely whether
the market goes up or down.

The goal of tweets classification is to use them as a time series supposedly correlated
to another time series from stock data. The time span can vary but authors usually consider
one to six months of data. The function linking both time series is unknown, but it is possible
to approach it by applying regression techniques. A very common tool is the Granger
causality analysis, which helps identifying whether a time series has predictive information
about the other or not. Other techniques include Vector Autoregression (VAR) and neural
networks. After estimating parameters, it is possible to validate and test the model by
validating it thanks to previously annotated data.

The literature on these matters is still developing at the moment but many articles
already addressed the problem of using sentiment analysis to predict financial markets.
However, these have some important limitations.

Firstly, they focus on American stock markets which are highly volatile in nature
while this is less true for other geographical area. US is also the most represented country on
Twitter, this platform might not be as useful otherwise. Secondly, the time span considered is
usually quite short because of limitations in Twitter's API. Gathering historical data for longer
than a few months can be seen as challenging and this leads to the question of generalizing
results for a longer period. Thirdly, authors make many assumptions about the time series
which are not usually tested. They dive into advanced regression and machine learning
techniques without systematic formal tests of the hypotheses.

1.1 Structure

The report is structured in three main parts. We start by reviewing the literature on the
subject. In this section, we identify the market covered and the different exogenous and

4
endogenous variables used by the authors. describing briefly the most important techniques
used in both text classification and prediction. We also mention the limitations identified.

The next part concerns our methodology and implementation where we start by
describing the approach chosen. This part can be cut in two main steps: text classification and
prediction. After explaining how we collect and clean both Twitter and stock data, we develop
how we implement a rule-based approach to annotate tweets with a sentiment score and how
we build relevant time series. The prediction sections give details about the linear and logistic
models and the different tests performed with the help of graphs. Finally we present the
results for our models and compare them thanks to accuracy metrics.

As an additional experiment, we apply the same methodology on a different dataset


which considers only an individual company stock. The conclusion summarizes the results
and the problems encountered. It eventually proposes prospects for improvements and ideas
for future work.

1.2 Contributions

In this work, the problem of predicting financial markets is addressed to a different


segment. Instead of focusing on American indices, we bring the analysis to European markets
by examining the Financial Times Stock Exchange 100 Index (FTSE100) which is based in
London. It is a first expansion of former studies to try and see if we can generalize results
found in American markets on market from another origin. Obviously, it is only a first step
before being able to validate a model or a methodology globally.

Then, we also consider a longer time span than former researches to bring more
significant results. We actually collect one whole year of tweets and stock data. It also allows
for more flexibility in terms of validating the results as we have more data points for both
training and testing a model. Another advantage is about considering all seasonal trends and
diminishing the impact of unexpected events. We get historical data with crawling techniques
that bypass Twitter's API limitations. Indeed, with the API you can only get one-week old
tweets, implying the researcher has to wait at least three weeks to build one month of data.

5
In this work, statistical tests are performed for assumptions checking in regressions.
We check normality of each of our time series with famous tests such as Shapiro-Wilk and
Kolmogorov-Smirnov. We also draw many graphs including normal quantile plots that allow
to ascertain or nuance results from the normality tests. Homoscedasticity is also checked
thanks to a Breusch-Pagan test. These assumptions are key in linear regression and other
kinds of regressions based on linear dependencies but are usually supposed true to directly
perform heuristic models and machine learning techniques.

We then make a new proposition of prediction model. Despite being a good model
choice for answering dichotomous questions like whether the market will go up or down,
logistic regression is never used in the literature reviewed. We describe the reasons for trying
this kind of model instead of linear regression. One of them is the lower number of
assumptions on the data. Even if it is usually assumed, it is hard to think that time series like
closing prices or Twitter sentiment are always following a normal distribution. Therefore, we
should expect better results with models where normality is not demanded.

Our last contribution is that we include specific accuracy metrics for evaluating our
models. Most authors just compute the normal accuracy and lose information by reporting a
single percentage. With the F1-measure we implement, we are able to tell if a model interprets
the inputs in an unexpected or unwanted way and possibly fix its parameters later to improve
it.

6
2. Literature review

For almost ten years, microblogging has been spreading through the web and many
researchers got an interest in this way of communicating. Because of its capacity to transmit
ideas across people, a research identifies it as online word-of-mouth branding (Chowdury,
Jansen, Sobel & Zhang, 2009). Twitter has very often been considered the most
straightforward choice by researchers for sentiment analysis and opinion mining on
microblogging data. Indeed, it provides with a huge volume of opinionated data on a very
broad range of subjects and has a free API for crawling tweets and users.

The first quantitative study on Twitter and its related information diffusion aimed at
stock prediction dates back to 2010 with authors such as Kwak, Lee, Moon and Park. Since
then, articles have multiplied and some authors have summarized and grouped disjointed
research (Fuehres, Gloor & Zhang, 2011; Brown, 2012). In the recent years, many companies
have found applications for global social media analysis but the research is still focused on
particular markets and conditions. In this paper, we focus on authors using opinion mining
and sentiment analysis for stock prediction applications or correlation analysis.

2.1 Markets covered

Most authors choose to take American indices as their market of interest. The most
common are the Standard & Poor 100 (S&P100) and Standard & Poor 500 (S&P500) which
can be used to get relevant time series to predict with data mined from microblogs (Bar-Haim,
Dinur, Feldman, Fresko & Goldstein, 2011). Others like Bollen, Counts and Mao (2011) opt
for the Dow Jones Industrial Average (DJIA). Previous articles also considered many
different indices for individual companies in their analyses (Rao & Srivastava, 2012). Again,
all indices considered are from American firms such as Apple or Google.

2.2 Variables
2.2.1 Dependent Variables

Stock data have several interesting values to investigate. The variables selected in
previous works reflect the diversity of possible applications. In most cases, several dependent

7
variables are observed such as stock's closing price or adjusted closing price, which is the
closing price amended to include dividends and other distributions that occurred before. Some
authors tried to use different market levels, considering either individual companies or sectors
(Liu, Mao, Wang & Wei, 2012).

The adjusted price is the former price, deducted of the aforementioned dividends and
distributions. Among others, Sandner, Sprenger, Tumasjan and Welpe (2010) also have a look
at traded volumes, which are more likely to grow with the number of tweets mentioning the
stock. Then, some authors try to predict returns and volatility.

Returns are actually natural logarithm returns R of stock prices S(t) over a time
interval of one day. This additional operation has several advantages such as normalization of
the variations.
= ln ( ( − ln ( (

Volatility is then calculated on a defined period of time, usually ranging from 10 to 50


days. It takes the standard deviation (sd) of the last n days, pondered by the square root of the
total number of days N in the time series. For each data point and the average , the
standard deviation is computed as follows.

∑ ( − ²
=
−1

Then naturally, volatility at time t is obtained as below for the whole time series at the
researcher's disposition.

= ∗ √!

The meaning behind volatility is the uncertainty or risk about the scale of change in a
stock's value. A high volatility refers to the likelihood for this stock to change significantly in
a short period of time in either direction.

Incidentally, the binary variable representing the movements of the stock are rarely
compared with twitter data in articles. Earlier studies strictly focused on establishing a
relationship between different time series but prediction models are harder to set for non
binary variables as they can take a very wide range of values. Krauss, Nann and Schoder
8
(2013) address these issues by describing returns on investment based on transactions
predicted by several sources. However, they mention several limitations which prove that the
application to prediction is still experimental in this field.

Indeed, stock data is influenced by information which can be retrieved from different
sources and interpreted in various ways by both the market and the researcher.

2.2.2 Independent variables

If the predominance of Twitter can be explained by its research-friendly attributes,


some authors still mention other web sources as their text data (Kraus & Nann, 2013; Bollen,
Counts & Mao, 2011). However, using microblogs and regular finance news together raise
questions about the difference in prediction power from each source. The texts are put in
relation with ulterior stock data. Many authors find it reasonable to consider a delay of one
day between twitter and stock data but Sandner, Sprenger, Tumasjan and Welpe (2014) also
tried a longer delay of two days.

With concern for a methodological approach, studies involving Twitter data usually
begin by omitting classification through opinion or sentiment and only count the number of
tweets. Liu, Mao, Wang and Wei (2012) investigated the daily number of tweets related to the
S&P500 stocks. The advantage of such choice is the simplification of operations by removing
a machine learning or manual annotation step of the different tweets at hand. Message volume
is obtained after taking the natural logarithm of the number of tweets. A slight variant of the
message volume is the number of users that tweeted and is used notably by Castillo, Gionis,
Hristidis, Jaimes and Ruiz (2012).

Nevertheless, such variables are a bit too simple and are not expected to yield robust
results in the case of stock movements prediction. Moreover, they are not suited for
instigating a relation with bullish or bearish market as they do not provide any opinionated
information. Therefore, authors tend to implement specific public mood state in their
correlation analysis. The approach can differ in diverse ways but the basic principle remains
the same. The point is to use an indicator which depicts the reality with a certain precision.

9
The most classical time series in this regard is the overall sentiment of tweets. It
consists of point data of two or three classes in which tweets are assigned. For instance, a
tweet can be positive, negative or eventually neutral regarding the stock studied. Bollen, Mao
and Zeng (2011) even enrich this simple reduction of human mood structure by adding other
mood dimensions2. Another article by Bollen, Mao and Pepe (2011) tries with other mood
states3. But deriving bullishness or bearishness from overall sentiment might be too simplistic
for some authors.

Actually, while positive or negative sentiment is linked with past or present


transactions, bullishness and bearishness represent belief about the future (Bar-Haim, Dinur,
Feldman, Fresko & Goldstein, 2011). This characteristic is far more elusive to determine
because the level of significance of tweets relating to the future can change tremendously but
if tweets can correctly be assigned, this method is a more realistic view of the relevant
variables. Rao and Srivastava (2012) define bullishness B at time t with this equation which
comes from the continuation of a previous work from Antweiler and Frank (2004). "#$%& &'(

is the number of positive tweets on day t and "


)(*+ &'(
, the number of negative tweets.

1 + "#$%& &'(
, = ln (
1+"
)(*+ &'(

Another interesting indicator, which they consider, is the level of agreement A among
positive and negative tweets. It takes the value 1 when the tweets are either all bullish or
bearish for a time unit measure t and is computed as follows.

" − "#$%&
)(*+ &'( &'(
. = 1− 1−
" + "#$%&
)(*+ &'( &'(

Finally, Bar-Haim, Dinur, Feldman, Fresko and Goldstein (2011) also speak of the
influence of the user's expertise. They assume that similar information does not have the same
weight depending on the emitter's expertise in the financial field.

2
Calm, Alert, Sure, Vital, Kind, Happy.
3
Tension, depression, anger, vigor, fatigue, confusion.

10
A different approach is to measure properties of an induced interaction graph such as
the number of connected components or statistics on the degree distribution (Castillo, Gionis,
Hristidis, Jaimes & Ruiz, 2012).

Figure 2.1 - Graph Schema4.

2.3 Machine Learning

One of the main issues when trying to infer a function useful for prediction is to build
classes from huge text dataset. A human can relatively easily label a tweet as positive or
negative with not much uncertainty; it becomes a far more complex task for a computer.
However, machine learning techniques aim to solve problems too intricate for classical
algorithms by proposing automatic learning from raw data. This approach is dynamic, in
contrast to static programming.
programming Here we list and describe briefly the main techniques that
authors use to assign Twitter data to different types of classes.

2.3.1 Unsupervised learning

This method is a machine learning task on unlabeled data and hence is harder to
evaluate. Since the desired output and the effect of the variables are unknown, the data is
clustered in classes based on relationships among them to derive a structure. In Bar-Haim,
Dinur, Feldman, Fresko and Goldstein (2011),
), tweets are labeled with the stock movements
mov
thanks to a SVM package from Joachims (1999), namely SVM-light
light. In unsupervised
learning, the output results in latent variables which may or may not be the target variables: it
is only showing similar behaviors.

4
Source: Castillo, Gionis, Hristidis, Jaimes & Ruiz, 2012.

11
2.3.2 Supervised learning

With this other machine learning task, the computer has now annotated data at his
disposition. The goal is thus to establish rules and correlations between inputs and outputs.
One of the assumptions of such a technique lies in the existence of a model linking input x
and output y. The program's task is to find the best parameters ∅ for this model.
= 0(1|∅

Unlike unsupervised learning which usually summarizes to clustering, supervised


learning can take many forms but only a few are usually considered for text mining. Therefore
in this section we will solely present classification techniques used in previous works in our
field of interest.

2.4 Classification

In a general framework, classification is about giving a text label or a discrete value


thanks to a discriminating function built on data already classified manually or coming from
past experiences. This function is then used to predict the class for new data. It works when
we try to predict a discrete value for our data point. An example of binary classification stands
in banking sector where it can tell whether a particular can receive a loan or not but it can also
predicts multiple classes to answer different kinds of problems.

In text classification, the principle described by Sebastiani (2002) is to assign a


Boolean value to a pair 〈 4 , 6& 〉 ∈ D × C where D is a domain of documents and C is a set of
predefined categories or classes. The task of the classifier is to find the function that
approximates the best the actual function giving a true or false value to the pairs 〈 4 , 6& 〉
meaning that the document is assigned to a class or not. A classifier is thus a rule or a model
which approaches the true unknown function. The coincidence between the two functions is
called effectiveness and can be measured through several metrics such as accuracy or F1-
measure.

12
Conversion to vectors and feature identification

To be able to apply such a rule, a corpus of text data has to be converted into vectors.
To achieve this feat, authors usually make the assumption of a bag-of-words approach where
each word is considered independent of the others present in the document. For instance,
Grčar, Lavrač, Smailović and Žnidaršič (2013) avoid using part of speech taggers as they are
not useful for further analysis with text classifiers like SVM (Lee, Pang & Vaithyanathan,
2002). In Chalothorn and Ellman (2014), each word is given a term frequency-inverse
document frequency (tf-idf) score which is the relative frequency of a term in a set of
documents. This score weighs the relative importance of a word in a document from the
collection (Manning, Raghavan & Schütze, 2008).

0− 0 ,< = 0 ,< ∗ 0

0 ,< =
,<
∑= =,<

Where n>,? is the number of term t appearing in a document d. ∑A nA,? is the total
number of terms k in the document d.
E
0 = log

Where E is the total number of documents. is the number of documents d where


term appears.

However the matrix obtained can be huge with text data. To avoid the curse of
dimensionality and overfitting, a feature selection step is proceeded with taking only the best
tf-idf scores. Other metrics can be used and have been compared in earlier studies such as chi-
squared, mutual information or information gain (Forman, 2003). Feature extraction
techniques such as latent semantic indexing are used in Bollen, Mao and Zeng (2011). The
difference between selection and extraction is that in the former, a set of features is selected
without transformation, while for the latter the features are transformed to fit in a lower
dimensional space. Both can be used but transforming data is highly subjective in nature and
feature selection tends to be preferred for text classification.

13
Natural language processing techniques such as lemmatization, removing sparse
entries and stop words can be used to get simpler vectors on which classifiers can be applied
(Rao & Srivastava, 2012; Chalothorn & Ellman, 2014; Bollen, Mao & Zeng, 2011).

2.4.1 Support Vector Machine

In a binary classification problem, the main idea behind SVMs is to find the hyper
plane which maximizes the margins:
F∙1+H =0
The margins are the distances between this hyper plane and the closest points from
each class. Figure 2.2 shows how a linear kernel can represent non-linear functions. Even with
thousands of data points, only the closest points are used to determine the optimal hyper
plane; these points form support vectors. New data points are assigned to a class based on
their position relative to the continuous line.

Figure 2.2 - Optimal linear separator for classification5.

SVMs are among the most used classifier for text classification. Joachims (1998) gives
four arguments in that regard. First, text data have high dimensional input space and SVMs
can handle this dimensionality because their overfitting protections do not depend on the
number of features, as only a few data points compose support vectors. Secondly, even though

5
Source: http://tuicool.com

14
feature selection techniques can reduce the dimensionality, it has been shown that even
relatively bad features still hold a lot of relevancy. Thirdly, document vectors are sparse and
as SVMs are well suited for this kind of problems with sparse instances and dense concepts
with many features (Kivinen & Warmuth, 1995). The fourth and last argument states that
most text categorization problems are linearly separable and linear SVMs' idea is precisely to
find linear separators.

Even though other SVMs with non-linear kernel exist, in Twitter investigation, authors
mainly choose linear ones for the sake of speed and overall better suitability to text
classification. It is the case in Grčar, Lavrač, Smailović and Žnidaršič (2013) where the
authors construct 1,254,163 features used for classification training. Other works mention the
utilization of linear SVM packages with good results (Bar-Haim, Dinur, Feldman, Fresko &
Goldstein, 2011; Challothorn & Ellman, 2014).

2.4.2 Naive Bayes

Another very popular text classification technique is the probabilistic classifier Naive
Bayes. It works under two strong assumptions: firstly, the conditional independence which
says that the terms are independent of each other given the class; secondly, the positional
independence which is the same as considering a bag-of-words model. Despite these naive
assumptions, Naive Bayes proved to be very efficient both in theoretical and empirical studies
(Zhang, 2004) as long as being quite easy to interpret.

We only consider multinomial Naive Bayes (mNB) in contrast to Bernouilli Naive


Bayes because the former takes the term frequency in a document into account. The frequency
can be crucial information in this domain and especially with the bag-of-words assumption
where the order of the words holds no significance.

Consequently, the probability of a document or tweet to be assigned to a class 6 is:

J(6| ∝ J(6 L J( = |6
M=M N

J( = |6 is the probability of a term at position O in to occur in documents from 6.


J(6 is the probability of a document being of class c based on training data. < is the number

15
of terms in . The class assigned is thus the one which yields the highest probability under the
same estimated parameters. Accordingly, we call it the maximum a priori (map) class:

J( |6 J(6
6P+Q = RS T 1
U∈ℂ J(

The fraction is simply J(6| rewritten with Bayes' rule. The denominator can even be
removed because it remains the same for all classes. It is worth to keep in mind that the
probabilities considered are not the exact true probabilities as the parameters used are
estimated from training data. However, this point is negligible when looking for the best class.

Rao and Srivasta (2012) state that for online classifier Naive Bayes classification
works better than other methods in the context of tweets and even with huge dataset. An
accuracy of 82.7% was reached on a 1,600,000 tweets dataset. Nonetheless, some other works
go in the other direction (Chalothorn & Ellman, 2014). Kraus, Nann and Schoder (2013) used
mNB with an adapted bag-of-word to implement part-of-speech tagging to find negations and
spam filtering based on keywords and stress on the importance of adapting the classifier to the
data analyzed and context. Moreover, Pak and Paroubek (2010) used it in a similar way with
part-of-speech and N-gram tags as features.

2.5 Correlation and prediction


2.5.1 Granger causality analysis

Next step is to establish correlation and to try predictions method. A popular method
to search for correlation is the econometric technique Granger causality analysis which is
based on linear regression. Its main principle lies in the assumption that if a variable X causes
Y, then X will undergo changes before anything occurs in Y. Lagged values of X should then
significantly be correlated with Y. However, despite its name, Granger causality analysis only
exhibits correlation which does not automatically mean causation. Nevertheless, it remains a
great tool to indicate whether a time series has predictive information for the other.

Authors draw tables with statistical significance values (p-values) for different lags
ranging from 1 to 6 or 7 days and with different time series. For instance they test it only on

16
positive tweets or on tweets labeled with a mood dimension such as Calm6 (Bollen, Mao&
Zeng, 2011; Rao & Srivastava, 2012; Grčar, Lavrač, Smailović & Žnidaršič, 2013).

In statistical hypotheses testing, the p-value allows to take decision about rejecting or
not a null hypothesis. Here the null hypothesis is that there is no correlation between the
lagged time series and it is rejected when the p-value is under the significance level (usually
5%). Straightforwardly, the lowest p-values are observed for the time series to have the most
predictive information for the stock studied.

2.5.2 Vector Autoregression

In 2013, Deng et al. apply VAR to build a prediction model. This model assumes there
is an interdependence between two time series {1 } and { }. The goal is to fit parameters {Y}
to predict both of them. The formula when considering a one-day-ahead prediction is written
as follows:
1 = Y 1 +Y Z + [\,
= YZ 1 + YZZ + [],

Here, {[} are the white noises. The training data are moving with time instead of
training in one static period and predicting a disjointed one. The training is made on a sliding
time window and the prediction is made for the day following this window. This feature is
taking into consideration the dynamic and random nature of both time series and allows for
more training and testing points than with static training.

2.5.3 Self-Organizing Fuzzy Neural Network (SOFNN)

Neural networks are often compared to a black box in aircraft because it is very
complex to understand their inner structure. But this problem can be avoided by combining
them with fuzzy rules which are simple to interpret. A fuzzy neural network finds the
parameters of a fuzzy system thanks to neural networks techniques. Finally, the rules are
created by a clustering algorithm using reward and penalty mechanism to adapt the system
with each new dataset fed. Hence, it can be initialized with no prior knowledge of the input
data distribution (Gao, Huang, Lin & Song, 2007).

6
See section 2.2.2 page 9.

17
Even though the previous models seem to be able to be used for predicting stock
markets with Twitter data inputs, they are based on linear relationships. Yet the relation
between mood state and stock values is probably non linear. Therefore, Bollen, Mao and Zeng
(2011) propose to use fuzzy neural networks to be able to fit non-linear links. The choice of
SOFNN is supported by many arguments. First, it combines the learning ability of neural
networks and the simple interpretability of fuzzy systems. It is also a finely tuned algorithm
for regressions and time series analysis. However, they do seem to work efficiently with the
lagged public mood inputs and stock index.

2.6 Forecasting performance evaluation

To evaluate the performance of their prediction tools, authors choose to compute


different metrics. Deng et al. (2013) plainly compare accuracies of prediction of stock
movements from their different VAR models. Many authors also opt for a simple accuracy
measure (Krauss, Nann & Schoder, 2013; Liu, Mao, Wang & Wei, 2012; Rao & Srivastava,
2012). While other authors use it in combination with the mean absolute percentage error
(MAPE) (Mao, Counts & Bollen, 2013):
− a`
∑& _ &
_
".J^ = &
× 100

where a` is the predicted value of the stock and & is the actual one.

Rao and Srivastava (2012) show a confusion matrix but still use the direction accuracy
in their results.

2.7 Limitations

The sector of exploiting microblogging for stock predictions is still quite recent and
experimental. As a result, we identified some weaknesses among the various articles treating
the subject.

To start, all of the studies reviewed are focused on American stock markets. These
stocks are highly volatile in nature and it is easier to crawl relevant tweets for them as they
18
can be affected by many actors across the world. This might not hold true for stocks from
another origin and especially since there are far more U.S. users than from the rest of the
world. American users are expected to be more active as well. In other words, representation
of the public mood might be way less accurate for other geographical areas such as Europe.

Another limitation of some articles is the short time span. The longest period
considered is 9 months in Grčar, Lavrač, Smailović and Žnidaršič (2013) but most authors
limit their analysis to not more than 3 months of data. However, some results of a Granger
causality analysis with individual companies' stocks are indeed better with a longer time span.
This could indicate that results with shorter periods are actually pessimistic.

Then, the next limitation is about the selection of the time span. The correlation
between stock data and Twitter data is clearer when the stocks are submitted to strong
variations (Grčar, Lavrač, Smailović & Žnidaršič, 2013). It is then tedious to use predictions
when the market is not subject to changes. To use tweets relatively safely as an input for stock
movements prediction, the investor must be sure that the stocks will move dramatically
enough. This raises the question of significance of such predictions.

Finally, there is no systematic testing of correlation between variables and the


accuracy measures can be too simplistic. The subject is very experimental in nature and
authors usually just check the accuracy of their models without validating formally their
assumptions on data. Plain accuracy can be misleading as a bad model can still provide good
results in some cases. As a matter of fact, this can lead to biased predictions. Moreover, a
model applied to a particular set might give good results but be inefficient in other conditions
as the variables such as public mood or stock movements are derived from very complex
phenomena difficult to translate into mathematical models.

19
20
3. Methodology
3.1 Approach

This paper is about identifying tweets sentiments and analyzing whether they do have
an impact or not on stock markets. Many studies already point out that it is possible to predict
American stocks' closing prices with such inputs but these largely overlook European
markets. Therefore, we try to generalize the results by analyzing the FTSE100. This index is
composed of the hundred companies listed on the London Stock Exchange (LSE) with the
highest market capitalization.

Besides being European, other arguments make us choose this index over other stock
indices from France or Italy. Firstly, the second most represented country on Twitter after the
U.S. is the United Kingdom. Even though we want to ensure that predictions are still
reasonable with a lower feed of tweets, it still makes sense to secure a relatively large corpus
to be able to apply relevant techniques.

Obviously, many other European users can also add their opinions in English. Taking
them into account in the sentiment analysis seems legitimate as the FTSE100 companies are
affected by international stakeholders. Another important consideration is the availability of
sentiment lexicons online. As expected, the most detailed resources are provided in English
and have the advantage to be cited in many works.

Then, we collect tweets and stock data for the whole 2015 year. By doing so, we are
able to cover each trend over the year and we avoid any possible effect of bias because of
seasonal volatility. A long period also allows for validating the results with a higher degree of
confidence.

The implementation follows the general framework schematized in Figure 3.1. We


study a daily time series of tweets in parallel to a daily stock market time series. After
identifying the overall sentiment for each day based on tweets' content, we can try to fit a
model to search for correlation. We put emphasis on assumptions checking as most studies
overlook this step and simply accept hypotheses. However, we still perform regressions
before checking the assumptions because we use them as a way to validate positive results

21
with a certain model rather
ather than abandoning it even though it could still provide with
heuristically good predictions.

Figure 3.1 - Approach for using Twitter data in stock movements prediction.
prediction

3.2 Data gathering


3.2.1 Stock data

First, we download a Microsoft Excel spreadsheet of various stock market data for the
FTSE100 index from Yahoo! Finance 7 for the whole 2015 year. This results in 253
occurrences representative of the number of days when the London stock market is open. In
this paper, we mostly consider the movements
movements of the FTSE100 daily closing price as our
variable of interest. This translates into a binary variable which takes the value 1 when the
price goes up and 0 when it goes down.

Occasionally, we also need the FTSE100 log returns which are computed by
b taking the
natural logarithm of the ratio between the daily closing price of the current day and previous
day.
ef$%& *#g&U(h
b S c dR = (ef$%&
*#g&U(hij
)

7
http://finance.yahoo.com

22
The log returns are used to calculate volatility which is the degree of variation
measured by their standard deviation over a certain period of time. One argument of using this
kind of formula rather than taking arithmetic returns is that it works better for modeling the
stock market. Indeed, if we assume that the
th closing prices are distributed log normally, we
cannot assume the same for the arithmetic returns. Logarithm functions conserve the
normality properties of the time series and allow computing safely statistics such as
compound return.

3.2.2 Twitter data

Then, we need to get tweets which are likely to have an impact on the FTSE100 index.
In Twitter, stock related tweets usually come with a $ symbol followed by the stock name.
This convention is very similar to the hash tags and is helpful in order to be sure that the tweet
is referring to financial subjects. For example, $VOD stands for the Vodafone stock. We
choose to use $ keywords exclusively in order to avoid irrelevant tweets
tweets as some acronyms
might be used in a different context.

Since we want historical data on the whole 2015 year, we cannot use the Twitter's
Twitter
API8 which is limited to only one week. Instead of using it, we exploit a web scraping tool
programmed in Java language
anguage which allows us to bypass some of its limitations. This tool is
an open-source project9 which can be used through the command line on Microsoft Windows
as shown in Figure 3.2.

Figure 3.2 - Command line to get tweets.

The choice of keywords is made with several considerations in mind. One of the
major concern is the number of tweets to extract, thus we search for several stocks in order to
boost the number of tweets and to get closer to the overall real comprehensive atmosphere
around the index. However, apart from the FTSE100 itself we select only the six strongest
8
https://apps.twitter.com/
9
https://github.com/Jefferson-Henrique/GetOldTweets
Henrique/GetOldTweets-java

23
representatives of five different sectors: Lloyd and HSBC for banking, BP for oil and gas,
GlaxoSmithKline (GSK) for pharmaceuticals, Vodafone for telecommunication and Unilever
for consumer goods. The codes for each company are shown in Table 3.1. This is to satisfy
requirements in terms of lower dimensionality but also to prevent redundancy and irrelevancy
of information. These tweets should be the most proficient among their sectors and should be
as representative as possible of the ones related to the FTSE100 index.

Company Code
BP BP
GlaxoSmithKline GSK
HSBC HSBA
LLOY
Lloyds Banking Group
LSE
London Stock Exchange Group
VOD
Vodafone Group
ULVR
Unilever

Table 3.1 - Seven companies from FTSE100 and their codes.

In the end, we get 72,194 tweets for the whole year and 198 daily on average. The
output from our previous Java request is a spreadsheet with many attributes for each tweet
retrieved 10. Among them we have the geographical area which could be used to limit the
scope of our analysis. But as explained beforehand, it is reasonable to keep a broader area for
international companies. One more argument is that the frequency of tweets where the
localization is indicated is only 1% of the total of tweets.

3.3 Sentiment analysis

Even though the number of tweets also constitutes an interesting time series to
compare with our stock data, sentiment analysis allows for the consideration of the tweets'
polarity and should intuitively improve the link with stock movements. Indeed, positive
tweets should not have the same impact than negative ones. Therefore, it is reasonable to
create a time series which translates this concept.

10
Username, date, number of retweets, number of likes, text, geographical area, mentions, hashtags,
user id and permalink.
24
3.3.1 Tweets cleaning

Twitter data provide with many different attributes that are not relevant in our case and
hinder the further analyses. In sentiment analysis, we only want to keep the text information
and do not need the username from whom it comes or the number of retweets.

Furthermore, to reduce dimensionality and simplify the inputs, text information has
to be cleaned of links, images and any insignificant terms for sentiment analysis. To achieve
this feat, we use the free software environment R with the tm (text mining) package.

With thousands of contributors and more than two million users worldwide, R
language is largely used for analytical and statistical purposes. Unlike some of its competitors,
R is an open-source project which induces a multitude of libraries and makes the software
even more polyvalent and flexible as a statistical analysis toolkit. It is also capable of drawing
graphics and data visualization and this is a major advantage that many languages lack.

With a tm package function named tm_map we remove numbers and punctuation but
opt for not removing stop words to avoid losing information regarding the negations in
sentences.

3.3.2 Tweets classification

Choice of classifier

As we acknowledged in the first section, classifiers like SVM and mNB based on
supervised machine learning are largely utilized for text classification. In this paper though,
we take the decision to ignore them and rather take a rule-based approach.

Some tools developed for other purposes already provide with good results. For
example, Linguistic Inquiry and Word Count (LIWC)11 is commonly used in social research.
It provides with many languages dimensions which might not be all relevant in the financial
context. The classification is an output in itself and this leads to a complex basis for further

11
http://liwc.wpengine.com/
25
analysis. For this reason and the lack of a publicly available license, we decide to code our
own simpler program.

As we want to label only historical data, training a model on tweets annotated with this
very same rule would only result in more misclassifications. Along with this argument, it is
also easier to implement and faster to run. The results are also easier to interpret as we do not
need to compare them with other labeled datasets. Of course, we still randomly select a
sample humanly annotated to compare the efficiency of our rule with the expected results
from a human perspective.

Classification implementation

First of all, we decide to sort tweets into three classes with a Java program. The use of
Java is supported by several arguments. The most important are the vast array of third party
libraries and the amount of documentation available. It is a very common choice due to its
performance and specification as an object-oriented language.

In Twitter, some accounts' sole purpose is to reflect actual facts without adding any
new information. This kind of tweets which are only reposting the latest closing prices which
are already available online should not have any significant impact on future movements and
thus we add the neutral class to the positive and negative ones in our classification.

In this context, the tweets are only historical data that are compared with a dictionary
mixing two different sources. The first one is the Hu and Liu (2004) lexicon, which has been
cited in similar works (Deng et al., 2013; Grčar, Lavrač, Smailović & Žnidaršič, 2013) to
perform an automated opinion analysis (Cheng, Hu & Liu, 2005). This lexicon is already
quite complete as it even takes care of misspelled words, which are likely to happen in social
media content.

However, this lexicon contains an unequal number of positive and negative words.
Indeed, it contains 4783 negative words but only 2006 positive words. In pure statistical logic,
this would result in an overestimation of the number of negative tweets. To mitigate this
effect, we added positive words from Harvard IV-4 dictionary (Bullard, Crossing & Dunphy,

26
1974). The final list of positive words consists of 2816 entries. Adding more lexicons does
not increase this number by much, which is why we disregard mixing more sources.

Thanks to a Java program, we compute a polarity score that ultimately converts into
only three values: 1, 0 or -1 for each class. As the code runs, each word of each tweet is
compared with the positive and negative lexicons. If the word is positive, it adds 1 to the
tweet's polarity score and subtracts 1 if it contains a negative word. Let us not forget that the
presence of a positive or negative word does not necessarily mean that the tweet expresses
such sentiment (Liu, 2010; Liu, 2012). Finally, thanks to conditional statements, the score is
changed to 1 if it is positive, -1 if it is negative and 0 otherwise.

Polarity Inverters (PI)

Additionally, we want to consider that a tweet containing a positive word may not
necessarily be positive. For example, when a positive word is preceded by a negation, the
sense is inverted. For example, in the following tweet: "FTSE100 Resistance 6100 6200 to
6450. Support 5874. I wonder if Range between 5900-6400 - probably not good for 2016", the
term good indicates a positive feeling, even though it is not.

Therefore, we also compare the words with a personalized negation lexicon of 55


terms, the presence of one of them in a tweet changes the sign of the polarity score.
Obviously, terms like not, never, don't or dont can be found in our negation lexicon and allow
for the previous example to be classified accordingly as negative. It is indeed a rather strong
assumption which can lead to other misclassifications. But financial tweets are usually very
short and straightforward, which make this assumption quite reasonable. We can assume no
long-range dependency for grammatical order thanks to the 140 characters limit from Twitter.

Sentiment variable

Once all tweets have been assigned to one of the three classes, the next step is to
compute a single value to characterize the overall sentiment before each occurrence of the
stock market time series. This value is simply the sum of each class value of the tweets posted
between two open days for the London stock market. This sum can either be positive or

27
negative. For example, if three positive tweets and one negative tweet have been posted, the
sentiment for this day will be 2.

Table 3.2 shows 10 tweets featuring the $FTSE100 keyword with their assigned
polarity with the rule-based Java program. As it can be seen, the tweets featuring many
hashtags and little information are correctly labeled as neutral and other more personalized
tweets are given an opinion.

Polarity Text
-1 Indices worldwide will fall into 40 day trough next week, the shape of which will define
the longer term outlook #dax #es_f #djia #ftse100

-1 EUROPEAN MARKETS: #FTSE100 down 0.14%, #DAX down 0.32%


pic.twitter.com/Zxrtn5K29a
-1 U.S Stocks hit this week, the DOW is down about 400 points, but the FTSE100 is stable so
far, but for how long? https://twitter.com/FerroTV/status/629534956780408832 …

0 Today's #Technical #Analysis of #Currencies #ftse100 #Gold & #Oil from FxNet
http://fxnet.com/news/news-category-technical-analysis-26 …
pic.twitter.com/wfk04ahvBJ

0 Today's Most Discussed #FTSE100 Stocks http://bit.ly/1Itopmu $RIO $AV $RSA $OML
$MNDI $RRS $ISAT $AZN $FTSE pic.twitter.com/NiAtOtPsdG
1 How the aim #market managed to outperform the ftse 100 #news #market #business
http://j.mp/1UgkpuR pic.twitter.com/QqHcu0RCqL
-1 FTSE 100 Technical Correction Should Not Deter Bullish Trend http://goo.gl/6emePA

-1 European shares pushed down by weak commodity prices - * FTSEurofirst 300 down 0.5
pct at close, FTSE 100 down ... http://ow.ly/39zj1f
1 U.K. Stocks Advance as FTSE 100 Resumes Trading After Holiday - Investors Europe
Offshore http://sco.lt/5Y3lib
-1 London #markets: ftse 100 closes lower as oil falls; 4-day win streak ends #news #market
http://j.mp/1MGq9IK pic.twitter.com/hQKluceBgd

Table 3.2 - Sample of classified tweets.

To validate the lexicon and polarity inverters implementation, we randomly select


tweets among the non neutral ones. The next step is to manually annotate 200 tweets as either
positive or negative to check how well our program is doing. We get 85% accuracy for this
sample. In comparison with machine learning techniques, the SOFNN from Bollen, Mao and
Zeng (2011) yields 87.6% accuracy and the SVM implementation from Grčar, Lavrač,

28
Smailović and Žnidaršič (2013) yields 81.06% accuracy. Additionally, without considering
the negation lexicon, we get a slightly lower 84%.

These results are expected but to confirm them, the sample size should be bigger.
Unfortunately, manually annotating tweets is a very long process which is also prone to
errors. At the very least, we can say that this is a good method to classify our individual
tweets.

Also, neutral tweets are excluded from this validation phase because it is harder for the
human mind to make the difference between a neutral and an opinionated tweet. But with
more than 6000 opinion terms considered with our personalized lexicon, it seems safe to
assume that the tweets assigned to neutral do lack information about sentiment.

3.4 Data normalization

Except for the stock movements, all variables, both dependent and independent are
normalized so that we can compare them in further analyses. We thus get z-scores for the
three exogenous time series: the number of tweets, the sentiments and the FTSE100 log
returns. Z-scores are calculated by subtracting the local mean and dividing the difference by
standard deviation. For a dataset X, the z-score of 1& , k(1& is computed as below.

1& − μ(
k(1& =
m(

29
3.5 Time series

We get 4 time series regrouped in a spreadsheet used with R as list variables. Table
3.3 displays how they are referred to in this paper.

Time series Description of data points


Binary Stock movements of the day, it takes the value 1 if the market goes up and 0
if it goes down

zLogReturn Normalized natural logarithm of the return of the day


zNbTweet Normalized number of tweets per day
zSumsPI Normalized Twitter sentiment score of the day

Table 3.3 - Names and meaning of times series.

As a precision, let us mention that the returns from zLogReturn are computed on
adjusted closing prices. The z starting the name of variables refers to Z-scores and means we
deal with normalized variables.

Moreover, the name zSumsPI comes from the computation operated beforehand:
data points are calculated as the sum of the classes of each tweet for a particular day. Then, PI
stands for polarity inverters, indicating this time series is using the list of classes taking PI
into account.

Finally, zNbTweet is based on the number of tweets between two dates when the
stock market is open. Out of the 365 days in a year, LSE only opened 253 days. This means
that days preceded by a longer idle period are supposed to be influenced by more tweets. This
phenomenon is mitigated by the normalization but also by two assumptions. First, fewer
tweets are sent per day when the market is closed. Then, even if they are posted two days or
more before the opening, they should still have an impact on both the latest tweets and the
stock movement.

30
3.6 Modeling (FTSE100 index level)

In the next sections, we try to derive a model which correctly approximates the links
between the observations. If the model is validated, it might be used for predictions of stock
movements based on Twitter data.

3.6.1 Plot visualization

To get a first idea about the relationship between the variables, it is a good start to
plot them. We start by comparing one-day-ahead12 exogenous variables with Binary. Each
circle represents a data point. If the point is green, it means that the market goes in the
expected direction (e.g. there were a lot of negative tweets yesterday and the market goes
down today). If it is red, it goes against the hypothesis that the tweets have this kind of
influence on stock movements.

Figure 3.3 - Stock movements in function of the number of tweets the day before.

As we can observe on Figure 3.3, it is hard to see a clear cut between the low and
high number of tweets in terms of predictive information. Even if the density of points when
the market goes down tend to be in favor of the intuition, it seems as equally dense when the
market goes up, indicating that the number of tweets has few chance to be correlated in any
ways with the stock movements. We also draw a graph to compare zNbTweet with absolute
12
We implicitly take into account the days when the market is not open as explained in 3.5 page 30.

31
values of zLogReturn (in appendix figure A1) as we could assume more tweets would relate
to a stronger variation, regardless of the direction. However the graph does not show such
relation.

Figure 3.4 - Stock movements in function of the Twitter sentiment score the day before.

The same scheme is observed on Figure 3.4. A past negative Twitter sentiment score
does not seem to translate into a declining market systematically. Similarly, a positive score
does not translate systematically in a raise for the next closing price either.

Figure 3.5 - Stock movements in function of the returns of the day before.

Again, the last graph is not pointing towards a favorable conclusion. Yet before
rejecting the possibility of a link, we try to apply regression models on our time series.

32
Figure 3.6 - Returns in function of the Twitter sentiment score the day before.

Figure 3.6 shows a scatter plot with returns as the ordinate to investigate a potential
linear correlation between them and Twitter sentiment. Once more, data points seem very
sparse and do not show a particular trend. Other graphs representing the returns as the
ordinate can be seen in the appendices (figures A2 and A3) but they do not bring much
information either as the points are all really dispersed. For this reason, we avoid trying to fit
any models to predict returns and focus on stock movements.

3.6.2 Linear regression

We still try to estimate parameters a and b for a linear regression with only one
independent variable of the classical form:
n = +H+[

where Y is the time series we try to predict (i.e., Binary) and X is the time series which
supposedly holds the most predictive information about Y. The [ represents the error terms
also called residuals.

The advantage of fitting a linear model is that its predicted values have a
straightforward interpretation. Parameters can easily be adjusted over time when the market is
expected to change more drastically because of a recent crisis. However, it lies on several

33
strong assumptions and are less appropriate if probabilities of both our possible binary outputs
are extreme.

We check formally two of the said assumptions: homoscedasticity of the residuals and
normality of the error distribution. The statistical independence is assumed and the linearity
can be checked by looking at plots13.

We use the lmtest package to fit our data thanks to Ordinary Least Squares (OLS) and
to apply different tests on our obtained model. With the lm function, we fit the model which is
stored in a variable p. Then, we print the slope and intercept estimated with the coef function.
The parameters can be tested by the coeftest function to see if they are statistically significant
thanks to a Student-test.

For each of the four fitted models, we report the parameters and their significance
value in Table 3.4.

Variable used in model Slope Slope p-value Intercept Intercept p-


value
zLogReturn 0.036214 0.2735 0.517866 <2e-16
zNbTweet 0.0049675 0.8774 0.5158330 <2e-16
zSumsPI -0.024324 0.4428 0.515881 <2e-16

Table 3.4 - Estimated parameters of linear regression and their p-values.

In all cases, the slope parameters are very low, indicating a rather flat line. The slope
p-values do not allow validating them as they are not below 5%. A different conclusion is
given to the intercepts which are very close to 0.5 for the three models and for which we can
be sure as the p-value is extremely low.

Breusch-Pagan test

One of the assumptions of a linear regression is the homoscedasticity of the residuals.


This means that we assume that all residuals have the same variance. This can be tested by a
chi-squared test called Breusch-Pagan test (Breusch & Pagan, 1979). If the p-value obtained is

13
See Figure 3.6 page 33 and Figures A2 and A3 page I and II.

34
below the significance level, we reject the homoscedasticity assumption. Different tests such
as the White test based on Student distribution exist but we choose to stick with Breusch-
Pagan for its easier interpretation and its straightforward implementation in lmtest package.

Variable used in model Breusch-Pagan p-


value
zLogReturn 0.005814
zNbTweet 0.6724
zSumsPI 7.143e-05

Table 3.5 - Breusch-Pagan test p-values.

We have to reject homoscedasticity, except for the number of tweets where the p-value
is largely above the significance level. This does not mean we truly have homogeneity of
variances but at least we cannot reject it formally.

Shapiro-Wilk and Kolmogorov-Smirnov test

When we make linear regression, one common assumption is the normality of the
residuals' distribution. The Shapiro-Wilk test is an analysis of variance test whose
interpretation is quite straightforward (Shapiro & Wilk, 1965). However, as a rule of thumb,
when we have more than 50 cases, it is preferable to use a Kolmogorov-Smirnov test. Both of
these tests have limitations and might give false indications about the non-normality of data.
Therefore, we also have to check plots to analyze the kurtosis and skewness.

In-built R functions allow for such tests. They all provide p-values for the null
hypothesis of normality which is rejected if the p-value is below 5%.

Variable used in model Kolmogorov- Shapiro-Wilk


Smirnov
zLogReturn 0.2931 0.0002221
zNbTweet 9.367e-05 2.279e-11
zSumsPI 0.008729 3.236e-09

Table 3.6 - P-values of tests for normality.

35
According to the Kolmogorov-Smirnov tests, the only plausible time series with errors
distributed according to a normal distribution is zLogReturn. Shapiro-Wilk p-value says
otherwise, though. But as we stated earlier, Shapiro-Wilk is usually better suited for small
datasets.

Again, we should not rely only on statistical tests but also assess normality based on
observation of graphs. We choose to draw the three normal quantile plots to check if the
points follow the diagonal correctly. A bow-shaped deviation indicates a high skewness and
an S shape is rather symptomatic of a high kurtosis.

According to Figure 3.7, the plot closer to the diagonal is for zLogReturn. It resonates
with the Kolmogorov-Smirnov results. The shape is more debatable for Twitter sentiment
scores but we can clearly identify a deviation in the normal quantile plot for zNbTweet.
If we combine this analysis with the statistical tests, we should only assume normality for the
returns.

36
Figure 3.7 - Normal quantile plots.

37
However, non-normality does not imply that the parameters estimated cannot yield a
minimized mean squared error when predicting. But it does mean there is a problem and that a
better model exists because of some unusual data points which are overlooked in a classical
linear regression.

Since many assumptions, especially the linearity one, are violated, we do not think the
linear model is trustworthy for predictions.

3.6.3 Granger causality analysis

As we have seen before14, despite not implying causation per se, Granger causality
analysis constitutes a great tool to determine whether a time series has predictive information
about another. When performing this test, we obtain a p-value. If this value is lower than the
level of significance, this means that the independent time series Granger-causes the
dependent one. Here, we arbitrarily fix the significance level at 5%.

We try to improve our linear model by incorporating the lagged stock movements.
Mathematically and statistically, Granger causality does not exist if and only if no lagged
value of X is retained in the univariate autoregression of Y. The values of X are retained if
they are individually significant according to a student-test and if they collectively add
explanatory power according to a Fisher-test. Hiemstra and Jones (1994) define it in details in
their work and confirm its performance in the financial context.

We test three null hypotheses: for each of our three independent time series, we want
to know if we reject that it does not predict stock movements. Clearly, when we reject this
hypothesis, this means that the independent time series Granger-causes the stock movements.
Similarly to Grčar, Lavrač, Smailović and Žnidaršič (2013), we consider different lags for
each of our pair of time series.

To implement it on our data, we use the vars package in R. Thanks to a loop, it lets us
choose the lag and gives a direct interpretation besides other outputs. The p-values of the
Granger causality tests are displayed in Table 3.7.

14
See section 2.5.1 page 16.

38
Lag zLogReturn zNbTweet zSumsPI
1 day 0.4988 0.9054 0.4032
2 days 0.6464 0.9116 0.2756
3 days 0.5369 0.9268 0.1442
4 days 0.2258 0.3929 0.2518

Table 3.7 - Granger causality correlation p-values between stock movements and independent variables.

Unfortunately, it appears no p-value is under 5%. This means that these variables do
not provide predictive information for stock movements. The closest is attained with a lag of
3 days for Twitter sentiment score. Knowing how fast microblogging and the market can
change, this result can be a bit surprising but we cannot interpret it rigorously as the p-value is
not significant.

However, while the stock movements are represented by a binary variable, the
Granger tests performed are based on linear relationships. Even though extensions of this
model can support non-linear causal relationships, it remains less appropriate.

3.6.4 Logistic regression

A logistic regression might prove to be more useful in this context. Indeed, this
generalized linear model is known to be suited for dichotomous problems because it is able to
give a clear answer to a problem via probabilities. Therefore, it is reasonable to consider it for
predicting a binary variable. As a side note, using Chi-squared tests might be a simpler way to
describe the strength of the relationship but it does not consider a dependent variable and
therefore is not the best choice for predictions.

The model for the logistic regression computes the probability of a binary variable n
being equal to 1. It then seeks to fit all the b parameters for the different variables &

considered.
1
J(n = 1 =
1+c (op q∑ or sr

39
As opposed to linear regression, logistic regression has fewer assumptions to check.
For instance, it does not require
requir normality of the error distribution. Homoscedasticity can be
disregarded as well as linearity. To fit the model with a binary response, our software uses
Maximum Likelihood Estimation
stimation (MLE) while linear regressions require OLS from which
residuals emanate. We can think of the logistic function to be a set of Bernouilli distributions
with probabilities which depend on the data for the independent variables.

Figure 3.8 - Logistic function.

Figure 3.8 shows the expected S-shaped


S shaped curve of a logistic function representing the
probabilities of having 1 as the response. If the model is correct, the visual representations of
our actual data points should be denser at the tails. We have seen earlier that it is not obvious
in our case15.

One of the downside of logistic regression is its interpretability. It usually makes


reference to Odds Ratio (OR) which is the ratio of odds. The odds for the market going up the
next day are computed as follows.
follows

J(n = 1 J(n 1
t ( & =
J(n 0 1 J(n 1

Then, if we add 1 to && we can compute the OR as the impact on the odds:
odds

t ( & -1
t c op q∑ or
t ( &

15
See figure 3.3 page 33 and figures 3.4 and 3.5 page 34.

40
We see that exponentials of the parameters explain the impact of incrementing the
independent variables. OR can also be used to compare two models using different variables
and assess in which data an event is more likely to occur. They are used more and more
because of the widespread use of logistic regression in medical and social science research.

Yet they can be confusing when trying to understand the overall impact on
probabilities. Taking the natural logarithm of OR helps mitigating the effect of sensitivity
towards relative positions when comparing odds with different groups of data.

We implement the logistic regression with a built-in R function called glm with a
binomial function and a logit link as its family parameter. The variables are the same: Binary
as the dependent variable and the three others with a one-day delay as the independent ones.
Since logistic regression can also handle binary exogenous variables, we also consider lagged
values for the stock movements.

Our first implementation fits the four simplest bivariate models with lagged variables
as the exogenous inputs.

Variable Intercept Intercept p-value Variable Variable p-


parameter value
zSumsPI 0.09115 0.556 -0.13525 0.336
zNbTweet 0.06992 0.661 0.04571 0.757
zLogReturn 0.09725 0.529 0.07934 0.623
Binary -0.1236 0.579 0.3981 0.199

Table 3.8 - Logistic regression parameters for bivariate model.

Unfortunately, the p-values for all of the parameters are above the significance level of
5%. We can make some observations: the lowest p-value is addressed to the lagged Binary
variable. Also, the parameter for zSumsPI is negative. This is quite singular because it should
mean that a more positive Twitter sentiment has a negative impact on the market.

Now, we report the results for pairs of independent variables. We voluntarily omit the
pair Binary - zLogReturn as they share the same meaning.

41
Variable Intercept Intercept 1st 1st 2nd 2nd
p-value Variable Variable Variable Variable
parameter p-value parameter p-value
zSumsPI-zNbTweet 0.07243 0.651 -0.14643 0.309 0.07024 0.642
zSumsPI-Binary -0.1088 0.627 -0.1270 0.369 0.3845 0.216
zSumsPI-zLogReturn 0.10837 0.486 -0.15457 0.279 0.07461 0.644
zNbTweet-Binary -0.13366 0.554 0.04001 0.787 0.39553 0.202
zNbTweet-zLogReturn 0.08574 0.593 0.04017 0.786 0.07619 0.637

Table 3.9 - Logistic regression parameters for multivariate model.

Again, the results appear to be statistically insignificant and with a Twitter sentiment
score that induces changes in the stock movements in the opposite direction as compared with
intuition.

3.7 Results (FTSE100 index level)

We could not ascertain the parameters fitted for both the linear and logistic models
theoretically but we can still use them to measure their performance heuristically. We have
253 days when the market is open for the year 2015. Out of these, we decide to take a holdout
set approach and train our model on 170 days which corresponds to two thirds of our data set.
The remaining third is the test dataset which is used for testing predictions and assessing
accuracy metrics.

In the literature, the plain accuracy is usually mentioned. However, in order to avoid
determining a model as good because it yields biased accuracy we also look at other metrics
and tools such as the confusion matrix and ultimately F1-measures. Some authors like Bar-
Haim, Dinur, Feldman, Fresko and Goldstein (2011) also use F1-measures but only in the
context of text classification and not for their predictions on the market. It is indeed a widely
spread metric in information retrieval.

In this section, we compare the metrics for our linear and regression model based on
our different variables. For logistic regression, we only look at bivariate models, given that
more variables do not seem to be relevant statistically.

42
We look at a confusion matrix that displays the false positives (FP), false negatives
(FN), true positives (TP) and true negatives (TN). Positive here refers to a value of 1 for our
Binary variable and negative is 0. Values are true when the predictions are accurate, false
positives refer to an error of type I and false negatives refer to an error of type II.

1 (predicted) 0 (predicted)
1 (actual) TP FN
0 (actual) FP TN

Table 3.10 - Confusion matrix.

The first value computed is the misclassification rate which is obtained by the
following equation:

uJ + u!
" 6 06 c=
J+!

Where P + N is the total population of the test dataset. Computing accuracy is


straightforward:

.66dR 6 = 1 − " 6 06 c
Thanks to the confusion matrix, we can also calculate precision and recall, which
ultimately can be used for F1-measure.

vJ
JRc6 =
vJ + uJ

vJ
c6 =
vJ + u!

2 ∗ JRc6 ∗ c6
u1 − Tc dRc =
JRc6 + c6

43
The reported results for our bivariate models can be found in Table 3.11 below.

Model Accuracy Precision Recall F1-measure


Linear
zSumsPi 0.5 0.50685 0.88095 0.64348

zNbTweet 0.51220 0.51220 1 0.67742


zLogReturn 0.54878 0.53333 0.95238 0.68376
Logistic
zSumsPI 0.5 0.50685 0.88095 0.64348
zNbTweet 0.51220 0.51220 1 0.67742
zLogReturn 0.54878 0.53333 0.95238 0.68376
Binary 0.60976 0.61905 0.61905 0.61905

Table 3.11 - Results for bivariate models.

As we can see, the results are not very interesting. The bivariate models share the same
accuracy metrics for linear and logistic regressions because of the extreme nature of a binary
variable to predict. We indeed round the values computed by the models to compare it
directly with our actual data. However, the differences are negligible.

Other results are also surprising: models using the number of tweets as the variables
yield a maximum recall. By analyzing the precision, we can understand that the model
computed many false positives. When we look at the predicted values, we discover that the
model never predicts that the market will go down for the days considered, which is very
unlikely. Overall, the models seem to overestimate the positive effect the variables have on
stock movements.

According to F1-measure, the most satisfying model is based on the lagged log returns
with an F-score of 68% and 55% accuracy. Lagged stock movements provide more consistent
results between precision and recall. This points out that using Twitter data alone is not
interesting for predictions. Nevertheless, it might still hold predictive information so we must
try multivariate models and analyze the results of table 3.12.

44
Model Accuracy Precision Recall F1-measure
Logistic
zSumsPI-zNbTweet 0.5 0.50847 0.71429 0.59406
zSumsPI-Binary 0.60976 0.60417 0.69048 0.64444
zSumsPI-zLogReturn 0.53659 0.52857 0.88095 0.66071
zNbTweet-Binary 0.60976 0.61905 0.61905 0.61905
zNbTweet-zLogReturn 0.52439 0.52174 0.85714 0.64865

Table 3.12 - Results for multivariate models.

The results seem again disappointing. In comparison with bivariate models, the overall
accuracy is never improved by incorporating Twitter data. A very slight increase in the F1-
measure is observed when combining Twitter sentiment and lagged stock movements.
Without Twitter data, it stands at 62% and rises to 64% with it.

In conclusion, the heuristic approach confirms our theoretical analysis. Our Twitter
sentiment scores are not suited for predictions of the FTSE100 index movements.

3.8 Modeling (company stock level)

In the first part, we used a mix of tweets relative to several keywords. Some of them
are representative of companies' stock index because FTSE100 is influenced by these stocks.
However, this implies that our previous collection of tweets is not comprehensive, since it
does not include tweets about all the stocks composing the FTSE100 index. Since we lose
information in this wider view, it might be better to focus on only one company's stock. By
doing so, we should improve the comprehensiveness of our Twitter data and supposedly its
correlation with its related stock.

Consequently, we choose to apply our methodology on BP stocks with tweets


containing the keyword $BP. It is the most tweeted stock out of our sample of companies16
and thus it makes it the most legitimate for investigating if tweets have an impact on the stock
movements.

16
See Table 3.1 page 24.

45
We do not need to go through the classification step again, since we are only
considering a subset of our bigger dataset. We have 16,867 tweets talking about the BP stock
and we get the stock market data on Yahoo! Finance as before. Again, we get 253 closing
prices on which we compute the log returns.

Finally, we obtain our time series after summing the tweets' classes for each day,
converting the returns in a binary variable and normalizing the data. The variables are labeled
with the same names previously described17.

3.8.1 Plot visualization

We begin by drawing plots. As a reminder: if the point is green, it means that the
market goes in the expected direction (e.g. there were a lot of negative tweets yesterday and
the market goes down today). If it is red, it goes against the hypothesis that the tweets have
this kind of influence on stock movements.

Figure 3.9 - BP stock movements in function of the number of tweets the day before.

17
See Table 3.3 page 30.

46
Figure 3.10 - BP stock movements in function of the Twitter sentiment score the day before.

Figure 3.11 - BP stock movements in function of the returns of the day before.

As in the first analysis, the graphs are not going towards a favorable conclusion. We
still try to apply regression before rejecting the possibility of a link.

47
Figure 3.12 - BP Returns in function of the Twitter sentiment score the day before.

For the company stock level too, the scatter plots with returns, which are shown in the
appendices (figures A4 and A5), are very sparse: the points are not placed on a line. We
therefore focus on stock movements.

3.8.2 Linear regression

We can again use the lmtest package to fit our data and apply the different tests on our
obtained model. For each of the four fitted models for predicting BP stock movements, we
report the parameters and their significance value in Table 3.13.

Variable used in model Slope Slope p-value Intercept Intercept p-


value
zLogReturn 0.040994 0.2293 0.464719 <2e-16
zNbTweet 0.02633 0.673 -0.01389 0.825
zSumsPI -0.051023 0.04696 0.464366 <2e-16

Table 3.13 - Estimated parameters of linear regression and their p-values for BP.

In all cases, the slope parameters are very low, indicating a quite flat line. Only
zSumsPI is associated with parameters which we can validate based on their p-values which
are indeed lower than the significance level of 5%. Despite that, we should remain careful
with the slope p-value since its p-value is very close to 5%.

48
Breusch-Pagan test

The first assumption we check is the homoscedasticity in residuals with a Breusch-


Pagan test.
Variable used in model Breusch-Pagan p-
value
zLogReturn 0.1842
zNbTweet 0.05259
zSumsPI 2.691e-05

Table 3.14 - Breusch-Pagan test p-values for BP.

We have to reject homoscedasticity when the p-value is lower than 5%. This only
happens for the model with Twitter sentiment, indicating the variances are not the same with a
95% confidence.

Shapiro-Wilk and Kolmogorov-Smirnov test

We also have to check the normality of the residuals' distribution. The results for both
tests used are listed in Table 3.15. The null hypothesis of normality is rejected if the p-value is
below 5%.
Variable used in model Kolmogorov- Shapiro-Wilk
Smirnov
zLogReturn 0.1685 1.508e-05
zNbTweet 1.09e-08 < 2.2e-16
zSumsPI 0.00221 < 2.2e-16

Table 3.15 - P-values of tests for normality for BP.

We can draw the same conclusions as for the FTSE100 analysis. According to the
Kolmogorov-Smirnov tests, the only plausible time series with errors distributed according to
a normal distribution is zLogReturn. Shapiro-Wilk p-value says otherwise, though. But as we
stated earlier, Shapiro-Wilk is usually better suited for small datasets.

According to Figure 15, the plot closer to the diagonal line of normality is for
zLogReturn. It is in line with the Kolmogorov-Smirnov results. We can clearly identify a
deviation in the normal quantile plots for zSumsPI and zNbTweet.

49
If we combine this analysis with the statistical tests, we should assume normality for
the returns only.

Figure 3.12 - Normal quantile plots for BP.

50
3.8.3 Granger causality analysis

The p-values of the Granger causality tests are displayed in Table 3.16.

Lag zLogReturn zNbTweet zSumsPI


1 day 0.6345 0.4408 0.1241
2 days 0.7923 0.4893 0.2754
3 days 0.3758 0.2096 0.2877
4 days 0.4464 0.16 0.2065

Table 3.16 - Granger causality correlation p-values between stock movements and independent variables for BP.

Unfortunately the results are the same as for the FTSE100 index level: it appears no p-
value is under 5%. This means that these variables do not provide predictive information for
stock movements. The closest is attained with a lag of 4 days for the number of tweets.
Knowing how fast microblogging and the market can change, this result can be a bit
surprising but we cannot interpret it rigorously as the p-value is not significant.

3.8.4 Logistic regression

In this next step, we try to fit parameters for a logistic model as our independent data
are still binary. Our first implementation in R fits the four simplest bivariate models with
lagged variables as the exogenous inputs.

Variable Intercept Intercept p-value Variable Variable


parameter p-value
zSumsPI -0.2150 0.1700 -0.2763 0.0969
zNbTweet -0.2212 0.158 -0.2365 0.177
zLogReturn -0.20285 0.190 -0.02707 0.877
Binary -0.21825 0.298 0.03593 0.908

Table 3.17 - Logistic regression parameters for bivariate model for BP.

The conclusion is the same than for FTSE100: the p-values for all of the parameters
are above the significance level of 5%. The lowest p-value is addressed to the lagged zSumsPI
variable. The parameters for both Twitter variables are negative; this should mean that

51
positive tweets and more tweets actually make the market go down. But let us recall that these
observations are not statistically significant.

As before, we report the results for pairs of independent variables. We voluntarily


omit the pair Binary - zLogReturn as they share the same meaning.

Variable Intercept Intercept 1st 1st 2nd 2nd


p-value Variable Variable Variable Variable
parameter p-value parameter p-value
zSumsPI-zNbTweet -0.2236 0.157 -0.2267 0.229 -0.1162 0.567
zSumsPI-Binary -0.20732 0.3275 -0.27726 0.0979 -0.01693 0.216
zSumsPI-zLogReturn -0.21668 0.1674 -0.27854 0.0968 -0.03536 0.8433
zNbTweet-Binary -0.23090 0.275 -0.23613 0.178 0.02137 0.945
zNbTweet-zLogReturn -0.2212 0.158 -0.23655 0.180 0.0008971 0.996

Table 3.18 - Logistic regression parameters for multivariate model for BP.

Analogically, the results appear to be statistically insignificant and with an unexpected


signs for parameters related to Twitter data.

3.9 Results (company stock level)

The reported results for our bivariate models can be found in Table 3.19 below.

Model Accuracy Precision Recall F1-measure


Linear
zSumsPi 0.52439 0.555556 0.125 0.20408
zNbTweet 0.512195 NaN 0 NaN
zLogReturn 0.512195 NaN 0 NaN
Logistic
zSumsPI 0.52439 0.53846 0.175 0.26415
zNbTweet 0.512195 NaN 0 NaN
zLogReturn 0.512195 NaN 0 NaN
Binary 0.512195 NaN 0 NaN

Table 3.19 - Results for bivariate models for BP.

52
Table 3.19 shows very particular results. Most of our models actually predict only 0 on
our test dataset and this is why we have no true positive values and no false positive ones
x
x
either. When computing the precision, the outcome is an undetermined equation . In this

case, R prints "Not a Number" (NaN). Of course the F1-measure, which is basically the
harmonious average of precision and recall, gets the same label. Here, the models
underestimate the positive impact of the number of tweets or the past returns on stock
movements.

The only models that actually predict a rise in the closing price are the ones based on
Twitter sentiment. But they are not very interesting, the accuracy is only 52% and the F1-
measures are even worse with respectively for the linear and logistic regression a value of
20% and 26%. We continue our analysis by checking the results of multivariate logistic
models.

Model Accuracy Precision Recall F1-measure


Logistic
zSumsPI-zNbTweet 0.54878 0.63636 0.175 0.2745
zSumsPI-Binary 0.52439 0.53846 0.175 0.26415
zSumsPI-zLogReturn 0.52439 0.54545 0.15 0.23529
zNbTweet-Binary 0.5 0 0 NaN
zNbTweet-zLogReturn 0.51219 NaN 0 NaN

Table 3.20 - Results for multivariate models for BP.

We observe a slight improvement in both accuracy and F1-measure when combining


the Twitter sentiment and the number of tweets. It is the best model we got but it is still
useless for predictions. We again have a model which only predicts descending trends for
zNbTweet-zLogReturn; however the interpretation is slightly different for zNbTweet-Binary.
This model actually predicts one day upward trend, allowing for the precision to be of value
0. Since both precision and recall are 0, the F1-measure is undetermined.

Similarly to FTSE100 analysis, the heuristic approach does not bring interesting
results. Our Twitter data are not suited for predictions of the BP stock movements.

53
54
4. Conclusion and future work

This paper's aim was to extend earlier studies in order to investigate further the impact
of tweets sentiment on stock market movements and, eventually, make trustworthy
predictions that could be usable by investors. To do so, we developed a methodology from
data collection till analysis of predictions. Therefore, this work was based on two main
concepts: text classification and regression.

Unfortunately, the results are inconclusive. We could not validate statistically any of
our prediction models and even heuristic processes lead to low accuracy and F1-measure. We
can hypothesize some reasons for that.

First of all, even though we considered the second most active base of Twitter users
with an English stock, it still holds true that this social network is used exponentially less by
Europeans in comparison with Americans. This implies our collection of tweets might not be
representative enough of the market. Even though we took care of gathering only tweets
relative to finance, thanks to $ symbol used in tweets, investors might not check Twitter for
European stocks such as FTSE100. Since the Twitter scene is smaller than for the US, even if
they are Twitter users, they probably do not check it as much as Americans because they do
not think of Twitter as a very reliable source for financial matters.

Then, we presented many articles being very enthusiastic about how Twitter sentiment
could be used for predictions. Despite using valid models, the accuracies reported were all for
American stocks which are highly volatile in nature. FTSE100 is not as volatile, its weaker
variations should be harder to be understood and especially by analyzing tweets. Tweets are
supposed to be a representation of the sentiment but even if we could summarize the true
sentiment about FTSE100, it is only a part of what influences the index. A higher volatility is
symptomatic of a stock easily swayed by the events and is expected to have clearer
correlations, since the moves are more salient.

Another of our limitation is the text classification of our data. Indeed, we only used a
rule-based approach to label each tweet with a positive, negative or neutral value. However,
the validation of this technique was based on a random sample of 200 opinionated tweets

55
which are manually annotated. On several thousands of entries, our sample size was rather
slim. The true accuracy could very well be lower than our current estimation.
Additionally, since we excluded neutral tweets from the validation step, the program
might have overestimated the number of non opinionated tweets. One possible problem is the
inadequacy of our lexicon in financial context.

Another important point is that we took many assumptions such as bag-of-words and
polarity inverters but automated text classification is a very complex problem in reality. Even
if our program understood perfectly the sentiment with advanced techniques to incorporate
the grammar, syntax and semantic concerns, it would not necessarily result in an accurate
sentiment. This is because we overlook many effects: investors might have different
interpretations for the same tweet, some users might have a bigger influence based on their
reputation or internal Twitter features. Some tweets could also be ironical or irrelevant and
such features are hard to identify, even as a human.

Moreover, the classes chosen could not be the most useful for predictions. The goal of
including neutral class was to avoid giving too much importance to either positive or negative
class. However, despite their lack of opinion they might still have underlying information
about the market. This paper focuses on Twitter sentiment but a different classification could
be more efficient. Let us give a simplified example: we could have classified tweets that only
refer to transactions and assign each of them to either "buy" or "sell".

As it has just been implied, there is no particular evidence a positive or negative


sentiment is the best attribute to give to a tweet when aiming for prediction of stock data. This
is not the only assumption: based on previous works in different context, we tried to
generalize the fact that tweets do have an impact on stock markets. We interpreted that even if
it has been discussed by previous authors on American stocks, it might not hold true in other
contexts because of the different characteristics of the data considered.

Our regression modeling was also quite simplistic with only linear or logistic
regression. Stock market movements have many different root causes and the link between
them might be more complex than what these kinds of models suggest. Once a link is
established formally, it might be better to use more advanced models such as neural networks
or other market-driven models from econometrics and financial literature.

56
Last of the limitations, despite the time span being relatively long, it is still fixed in
time and the results might differ from a year to the next. Of course, in our case, it is notably
related to Twitter expansion and market penetration. However, the actuality also plays an
important role: years with unexpected crisis such as the economic crisis of 2008 or smaller
cases have an impact on both stocks' volatility and Twitter activity. Agitated periods are
expected to create a better correlation between Twitter sentiment and stock market
movements than during calm periods.

Yet this last reasoning is a double-edged sword, because even if the data found during
periods of crisis are well correlated, the models derived cannot be used on a daily basis.
Ideally, the model should adapt itself to changes identified in Twitter. It would thus work
differently over time because of Twitter's growth and identified events. If no correlation can
be found formally, predictions should not be reported simply because a heuristic approach
yielded good results on a particular period of time.

In conclusion, we did not find any evidence of an impact from Twitter sentiment on
stock market movements for both FTSE100 index and BP stock. These stocks are related to
the LSE but as future work, other European indices should be tested. Tweets classification
should also be more elaborated in order to improve accuracy overall and to get closer to the
reality of the relevant causes of stock market movements. Finally, other advanced financial
models must be investigated as better tools to help traders make a decision.

57
APPENDICES

Figure A1. Absolute values of logarithm of returns in function of number of tweets for
FTSE100.

Figure A2. Returns in function of the returns the day before.


I
Figure A3. Returns in function of the number of tweets the day before.

Figure A4. BP Returns in function of the number of tweets the day before.

II
Figure A5. BP Returns in function of the number of the BP returns the day before.

III
Terms and acronyms

API Application Programming Interface

DJIA Dow Jones Industrial Average

FN False Negative

FP False Positive

LIWC Linguistic Inquiry and Word Count

MLE Maximum Likelihood Estimation

mNB multinomial Naive-Bayes

OLS Ordinary Least Squares

OR Odds Ratio

PI Polarity Inverters

S&P100 Standard & Poor 100

S&P500 Standard & Poor 500

sd standard deviation

SOFNN Self-Organizing Fuzzy Neural Network

SVM Support Vector Machine

tf-idf term frequency-inverse document frequency

tm text mining

TN True Negative

TP True Positive

VAR Vector Autoregression

IV
References
• Antweiler, W., & Frank, M. Z. (2004). Is all that talk just noise? The information content of
internet stock message boards. The Journal of Finance, 59(3), 1259-1294.

• Bar-Haim, R., Dinur, E., Feldman, R., Fresko, M., & Goldstein, G. (2011). Identifying and
following expert investors in stock microblogs. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (pp. 1310-1319). Association for
Computational Linguistics.

• Bollen, J., Counts, S. & Mao, H. (2011). Predicting financial markets: Comparing survey,
news, twitter and search engine data. arXiv preprint arXiv:1112.1051.

• Bollen, J., Mao, H., & Pepe, A. (2011). Modeling public mood and emotion: Twitter
sentiment and socio-economic phenomena. ICWSM, 11, 450-453.

• Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of
Computational Science, 2(1), 1-8.

• Breusch, T. S., & Pagan, A. R. (1979). A simple test for heteroscedasticity and random
coefficient variation. Econometrica: Journal of the Econometric Society, 1287-1294.

• Brown, E. D. (2012). Will twitter make you a better investor? a look at sentiment, user
reputation and their effect on the stock market. Proc. of SAIS.

• Bullard, C. G., Crossing, E. E. & Dunphy, D. C. (1974). Validation of the general inquirer
Harvard IV dictionary. Harvard University Library.

• Castillo, C., Gionis, A., Hristidis, V., Jaimes, A. & Ruiz, E. J. (2012). Correlating financial
time series with micro-blogging activity. In Proceedings of the fifth ACM international
conference on Web search and data mining (pp. 513-522). ACM.

• Chalothorn, T., Ellman, J. (2014). TJP: Identifying the Polarity of Tweets from Context.
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014),
657–662.

• Cheng, J., Hu, M. & Liu, B. (2005). Opinion observer: analyzing and comparing opinions on
the web. In Proceedings of the 14th international conference on World Wide Web (pp. 342-
351). ACM.

V
• Chowdury, A., Jansen, B. J., Sobel, K. & Zhang, M. (2009). Twitter power: Tweets as
electronic word of mouth. Journal of the American society for information science and
technology, 60(11), 2169-2188.

• Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3), 273-
297.

• Deng, X., Li, H., Li, Q., Liu, B., Mukherjee, A. & Si, J. (2013). Exploiting Topic based
Twitter Sentiment for Stock Prediction. ACL (2), 2013, 24-29.

• Forman, G. (2003). An extensive empirical study of feature selection metrics for text
classification. The Journal of machine learning research, 3, 1289-1305.

• Fuehres, H., Gloor, P. A. & Zhang, X. (2011). Predicting stock market indicators through
twitter “I hope it is not as bad as I fear”. Procedia-Social and Behavioral Sciences, 26, 55-62.

• Gao, X. Z., Huang, X., Lin, H. & Song, Z. (2007). A Self-organizing Fuzzy Neural
Networks. In Soft Computing in Industrial Applications (pp. 200-210). Springer Berlin
Heidelberg.

• Grčar, M., Lavrač, N., Smailović, J. & Žnidaršič, M. (2013). Predictive sentiment analysis
of tweets: A stock market application. In Human-Computer Interaction and Knowledge
Discovery in Complex, Unstructured, Big Data (pp. 77-88). Springer Berlin Heidelberg.

• Hiemstra, C., & Jones, J. D. (1994). Testing for linear and nonlinear Granger causality in the
stock price-volume relation. The Journal of Finance, 49(5), 1639-1664.

• Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp.
168-177). ACM.

• Joachims, T. (1998). Text categorization with support vector machines: Learning with many
relevant features (pp. 137-142). Springer Berlin Heidelberg.

• Joachims, T. (1999). Making large-scale support vector machine learning practical, 1999.
Advances in Kernel Methods: Support Vector Machines [9].

VI
• Kivinen, J., & Warmuth, M. K. (1995). The perceptron algorithm vs. winnow: linear vs.
logarithmic mistake bounds when few input variables are relevant. In Proceedings of the
eighth annual conference on Computational learning theory (pp. 289-296). ACM.

• Krauss, J., Nann, S. & Schoder, D. (2013). Predictive Analytics On Public Data-The Case
Of Stock Markets. In ECIS (p. 102).

• Kwak, H., Lee, C., Moon, S. & Park, H. (2010). What is Twitter, a social network or a news
media?. In Proceedings of the 19th international conference on World wide web (pp. 591-
600). ACM.

• Lee, L., Pang, B. & Vaithyanathan, S. (2002). Thumbs up?: sentiment classification using
machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods
in natural language processing-Volume 10 (pp. 79-86). Association for Computational
Linguistics.

• Liu, B. (2010). Sentiment Analysis and Subjectivity. Handbook of natural language


processing, 2, 627-666.

• Liu, B. (2012). Sentiment analysis and opinion mining. Synthesis lectures on human
language technologies, 5(1), 1-167.

• Liu, B., Mao, Y., Wang, B. & Wei, W. (2012). Correlating S&P 500 stocks with Twitter
data. In Proceedings of the first ACM international workshop on hot topics on
interdisciplinary social networks research (pp. 69-72). ACM.

• Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval
(Vol. 1, No. 1, p. 496). Cambridge: Cambridge university press.

• Pak, A., & Paroubek, P. (2010). Twitter as a Corpus for Sentiment Analysis and Opinion
Mining. In LREc (Vol. 10, pp. 1320-1326).

• Rao, T., & Srivastava, S. (2012). Twitter sentiment analysis: How to hedge your bets in the
stock markets. In State of the Art Applications of Social Network Analysis (pp. 227-247).
Springer International Publishing.

• Sandner, P. G., Sprenger, T. O., Tumasjan, A. & Welpe, I. M. (2014). Tweets and trades:
The information content of stock microblogs. European Financial Management, 20(5), 926-
957.

VII
• Sandner, P. G., Sprenger, T. O., Tumasjan, A. & Welpe, I. M. (2010). Predicting elections
with twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178-185.

• Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing


surveys (CSUR), 34(1), 1-47.

• Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete
samples). Biometrika, 52(3/4), 591-611.

• Zhang, H. (2004). The optimality of naive Bayes. AA, 1(2), 3.

VIII
Executive summary

Evaluating the impact of tweets on stock market is a major challenge and has
been addressed several times for American indices. The problematic relies on
multidisciplinary techniques such as the use of data mining, text mining and statistical
techniques in order to establish a link between tweets sentiment and stock market
movements. Yet the literature has mainly overlooked less volatile market. In our paper,
we focus on London stock market at two levels: the FTSE100 index and one of its
individual stocks. We explain how we collect data and use sentiment analysis to
annotate tweets automatically with positive, negative or neutral sentiment. Then we
perform statistical tests to ensure we can safely apply regression techniques. Eventually,
we try to build accurate predictions of stock market movements with various Twitter
variables such as the number of tweets per day or the overall sentiment. The results for
both the index and the company stock level are inconclusive because of low accuracies
and F1-measures.

Vous aimerez peut-être aussi