Week 1

DMML
– Arizona State University

Guilherme Peixoto

– Summary Week 1
Goal was to attempt to perform simple sentiment analysis classification.

1. Data Collection
1.1 Data collected from many sources:
n http://help.sentiment140.com/for-‐students/ provides a large labeled
twitter dataset for training, containing roughly 1.6M tweets available.
o Data format:
"4","2193574852","Tue Jun 16 08:38:37 PDT
2009","NO_QUERY","marshymiffy","Loved the USA hockey team "
§ 4: positive polarity
§ 0: negative polarity
§ Roughly 1:1 ratio
n Data queried from TweetTracker’s DB
o Roughly 100k instances

2. Data Processing
2.1 – Duplicates removal
n Specially on TweetTracker’s dataset, many tweets are duplicate
because of slight differences (such as a retweet or a different
hyperlink)
n Reduced 100k tweets collected to 66k~ distinct instances
2.2 – Text cleansing
n Removal of stop words;
n Removal of punctuation;
n Removal of mentions “@user” and hyperlinks
n Stemming remaining words

2.3 – Preparation for model’s input
n Data needs to be prepared for algorithm’s use
n Model of choosing: bag of words with bigrams
o Text goes from:
§ emilys parents are taking her to Cambodia as a
treat x emilys parents baffle me
o To:
§ {u'Cambodia treat': 1, u'x emili': 1, u'parent':
1, u'parent baffl': 1, u'Cambodia': 1, u'emili':
1, u'take Cambodia': 1, u'baffl': 1, u'treat':
1, u'x': 1, u'emili parent': 1, u'treat x': 1,
u'parent take': 1, u'take': 1}
3. Algorithm/Framework Selection
3.1 – Algorithm (Starter): Naïve Bayes
n Good baseline to measure accuracy
n Ease of use/implementation

3.2 – Framework Selection
n Open source libraries for python
o NLTK and Scikit-‐Learn
§ Naïve Bayes implementation
§ StopWords dictionary
§ Stemmer algorithm (Porter et. al)
o PyMongo
§ Python’s driver for MongoDB
o Numpy
§ Numerical processing
o cPickle
§ Easy object storing on HD
§ No need to deal with files/parsing/IO complications
n IPython’s interative terminal

4. Training
4.1 Choosing training data
n From Sentiment140’s database:
o The full bag of words model for all of the 1.6M tweets uses over
500MB of RAM
o Extremely slow
o No necessity in using all of the 1.6M – model gets too complex
o à Utilized only one tenth of the data
§ From 160k tweets left, first the model was trained on 70%
and tested on the other 30%
§ Accuracy: 75%
§ “A study from the University of Pittsburgh shows that
humans can only agree on whether or not a sentence has the
correct sentiment, 80% of the time” [taken from
https://semantria.com/sentiment-‐analysis]
§ 75% is good enough for a simple straightforward model, but
it can be improved

5. Testing
n Testing phase over TweetTracker’s actual data
n 1/10th of Sentiment140’s data (160k tweets) was used as training data
n 66k~ tweets collected from TweetTracker were labeled
o 62% of the tweets were labeled with negative polarity (-‐1)
o Therefore 38% with positive polarity (+1)
§ It seems reasonable, since all of the 66k~ tweets were in one
of the following categories:
• u'Ukrainian Protests', u'Myanmar', u'Humanity
Road', u'Suicide Bombings', u'Yosemite
Wildfire', u'Cambodia', u'Libya 2014',
u'Philippine Manila Flooding(Archived)',
u'Test', u'Syria', u'Public Safety Event',
u'Silver Fire'
• (don’t expect many positive tweets…)
6. Problems
n A lot of tweets are “report”-‐like, but due to the high polarity of the
words, are labeled as negative when it should be labeled as “neutral”
o E.g.: (Myanmar Sentences Muslim Man to 7 Years in
Prison via @GlobeNewsFeed) -1
§ Sentences, prison, Myanmar have a negative polarity
§ Tweet should be labeled neutral
§ Solution proposal à add a third category “neutral”, instead of
a binary problem
• Pick either positive/negative polarity only if the
probability of the label association is above certain
threshold, otherwise label as neutral
n Different languages
o Algorithm doesn’t perform as well for different languages
§ No stopwords filtering, no stemming…
§ Labeling is random
§ NLTK does not support Unicode, therefore most characters in
languages that do not use roman alphabet are simply ignored
§ Solution proposal à translation to English plaintext real time
before performing classification task
n Naïve Bayes accuracy
o Too simple, too straightforward
o Simply a baseline to guide from
o Try different algorithms
§ KNN can’t be used due to its slow nature and it would be
unfeasible to deploy a system using KNN with a large training
dataset
§ Support Vector Machines, Naïve Bayes based on
Multinomial/Gaussian distribution, Maximum Entropy

Week 1

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Week 1

Transféré par

Droits d'auteur :

Formats disponibles

DMML

– Arizona State University

Vous aimerez peut-être aussi