Académique Documents
Professionnel Documents
Culture Documents
Bachelor Thesis
Bachelor Thesis
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
VI
Abstract
Abstact
VII
VIII
Contents
Acknowledgments V
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Background 3
4 Conclusion 7
5 Future Work 9
Appendix 10
A Lists 11
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
References 13
IX
Chapter 1
Introduction
1.1 Motivation
Micro-blogging services such as Twitter, Facebook and Tumblr have been growing and ris-
ing rapidly recently, As of March 2013, 400 million tweets were being posted everyday[1].
This has initiated enormous research efforts to mine this data and use them in various
applications, such as event detection [Sakaki et al. 2010; Agarwal et al. 2012] and news
recommendation [Phelan et al. 2009]. Many applications could make use of informa-
tion about users locations, but unfortunately the information is very sparse, a research
firm Sysomos studied Twitter usage between mid-October and mid-December 2009 and
found that only 0.23% of tweets in that time period were geo-tagged which is a good
indicator how much this information is sparse. Although blogging services allow users to
specify their location in their profiles, the profile location field is not reliable, Cheng et
al. found that only 26% out of a random sample of over 1 million Twitter users revealed
their city-level location in their profiles and only 0.42% of the tweets in this dataset were
geo-tagged [Cheng et al. 2010]. Moreover these profile locations are not always valid
as reported that only 42% of Twitter users in a random dataset have reported a valid
city-level location in their profiles [Hecht et al 2011].
1.2 Aim
In this paper users location prediction approaches are discussed to overcome location
sparseness problem mentioned above. These approaches are based purely on the tweets
content and tweeting behaviour in the absence of any other location information. The
goal is to develop approaches that will be able to predict the location of the tweet, the
key step towards achieving this goal is to predict the home location of the user as the
home location can give important clues to the possible actual location of the tweet. The
intuition here is that the content of a tweet may contain some words, entity names or
phrases more likely to be employed in particular places than others which could give
1
2 CHAPTER 1. INTRODUCTION
indicators for the actual location. Developing these approaches to be able to predict
possible locations of a tweet will be very beneficial in tracking applications such as news
verification in which we want to know which tweets are reported by users who are likely
to be in the actual location of an event or versus tweets reported by users who are likely
to be far away.
1.3 Outline
In the remainder of this paper related work, data set, formalization of the location pre-
diction problem, location classification approaches, and an evaluation of discussed algo-
rithms and approaches are discussed. Then the conclusion comes with a discussion of
future work.
Chapter 2
Background
Background
3
4 CHAPTER 2. BACKGROUND
Chapter 3
In this chapter we introduce and describe several approaches for location detection over
social media.
5
6 CHAPTER 3. LOCATION DETECTION APPROACHES
We created this classifier for city level location for which we have ground truth. Each
user in our training dataset corresponds to a training example where the features are
extracted from the user tweet contents and the corresponding output is the geolocation
provided with that tweet. The number of classes in this trained model equal to the total
number of locations in our training dataset (total number of cities).
First, we tokenize all tweets in our training dataset to filter them, we filter tweets by
removing URLs, mentions and hashtags, then we remove any word that is identified as
stop word. Stop words are defined by a list of words provided by nltk stopwords corpus.
Once the stop words are removed, lemmatization in which we reduce the forms of a word
to a common base form is performed. Once the tokens have been extracted, we use simple
heuristic algorithm which is called CALGARI[?]. This algorithm is based on intuition
that a model will perform better if it is trained on terms that are more likely to be used
by some users from particular regions than users from the general population. In this
algorithm we define a score for each term, this score show us how likely this term happens
in our dataset. We will explain how this score is calculated below:
Let s(T ) be a function which takes a term and calculate the score for that term T , F(T )
be the frequency of a term T in our dataset, (T , c) be a function that count how many
times the term T is used with class c, is the total number of different terms in our
dataset and C be the set of classes (locations) in our dataset, we need to evaluate this
equation for each term:
max(P (T | c = C))
s(T ) = where c C
P(T )
F(T )
The term P(T ) = , so we need to know how to evaluate the numerator.
C (T , ci )
max(P (T | c = C)) = max P
i
(tj , ci )
j
Now after calculating a score for each term, the algorithm sorts the terms based on this
score in non decreasing order and choose the best 10,000 terms as features for our model.
One the features (chosen terms from previous step) are extracted for the classifier, we
build probabilistic classifier based on Multinomial Naive Bayes algorithm from scikit-learn
library with assumption of conditional independence of the features.
Chapter 4
Conclusion
Conclusion
7
8 CHAPTER 4. CONCLUSION
Chapter 5
Future Work
Text
9
Appendix
10
Appendix A
Lists
11
List of Figures
12
Bibliography
[1] W.G. Campbell. Form and style in thesis writing. Houghton Mifflin, 1954.
[2] S. Wenkang. An analysis of the current state of English majors BA thesis writing
[J]. Foreign Language World, 3, 2004.
13