Académique Documents
Professionnel Documents
Culture Documents
3.
Algorithm/Framework
Selection
3.1
–
Algorithm
(Starter):
Naïve
Bayes
n Good
baseline
to
measure
accuracy
n Ease
of
use/implementation
3.2
–
Framework
Selection
n Open
source
libraries
for
python
o NLTK
and
Scikit-‐Learn
§ Naïve
Bayes
implementation
§ StopWords
dictionary
§ Stemmer
algorithm
(Porter
et.
al)
o PyMongo
§ Python’s
driver
for
MongoDB
o Numpy
§ Numerical
processing
o cPickle
§ Easy
object
storing
on
HD
§ No
need
to
deal
with
files/parsing/IO
complications
n IPython’s
interative
terminal
4.
Training
4.1
Choosing
training
data
n From
Sentiment140’s
database:
o The
full
bag
of
words
model
for
all
of
the
1.6M
tweets
uses
over
500MB
of
RAM
o Extremely
slow
o No
necessity
in
using
all
of
the
1.6M
–
model
gets
too
complex
o à
Utilized
only
one
tenth
of
the
data
§ From
160k
tweets
left,
first
the
model
was
trained
on
70%
and
tested
on
the
other
30%
§ Accuracy:
75%
§ “A
study
from
the
University
of
Pittsburgh
shows
that
humans
can
only
agree
on
whether
or
not
a
sentence
has
the
correct
sentiment,
80%
of
the
time”
[taken
from
https://semantria.com/sentiment-‐analysis]
§ 75%
is
good
enough
for
a
simple
straightforward
model,
but
it
can
be
improved
5.
Testing
n Testing
phase
over
TweetTracker’s
actual
data
n 1/10th
of
Sentiment140’s
data
(160k
tweets)
was
used
as
training
data
n 66k~
tweets
collected
from
TweetTracker
were
labeled
o 62%
of
the
tweets
were
labeled
with
negative
polarity
(-‐1)
o Therefore
38%
with
positive
polarity
(+1)
§ It
seems
reasonable,
since
all
of
the
66k~
tweets
were
in
one
of
the
following
categories:
• u'Ukrainian Protests', u'Myanmar', u'Humanity
Road', u'Suicide Bombings', u'Yosemite
Wildfire', u'Cambodia', u'Libya 2014',
u'Philippine Manila Flooding(Archived)',
u'Test', u'Syria', u'Public Safety Event',
u'Silver Fire'
• (don’t expect many positive tweets…)
6.
Problems
n A
lot
of
tweets
are
“report”-‐like,
but
due
to
the
high
polarity
of
the
words,
are
labeled
as
negative
when
it
should
be
labeled
as
“neutral”
o E.g.:
(Myanmar Sentences Muslim Man to 7 Years in
Prison via @GlobeNewsFeed) -1
§ Sentences,
prison,
Myanmar
have
a
negative
polarity
§ Tweet
should
be
labeled
neutral
§ Solution
proposal
à
add
a
third
category
“neutral”,
instead
of
a
binary
problem
• Pick
either
positive/negative
polarity
only
if
the
probability
of
the
label
association
is
above
certain
threshold,
otherwise
label
as
neutral
n Different
languages
o Algorithm
doesn’t
perform
as
well
for
different
languages
§ No
stopwords
filtering,
no
stemming…
§ Labeling
is
random
§ NLTK
does
not
support
Unicode,
therefore
most
characters
in
languages
that
do
not
use
roman
alphabet
are
simply
ignored
§ Solution
proposal
à
translation
to
English
plaintext
real
time
before
performing
classification
task
n Naïve
Bayes
accuracy
o Too
simple,
too
straightforward
o Simply
a
baseline
to
guide
from
o Try
different
algorithms
§ KNN
can’t
be
used
due
to
its
slow
nature
and
it
would
be
unfeasible
to
deploy
a
system
using
KNN
with
a
large
training
dataset
§
Support
Vector
Machines,
Naïve
Bayes
based
on
Multinomial/Gaussian
distribution,
Maximum
Entropy