Vous êtes sur la page 1sur 3

DMML

 –  Arizona  State  University  


Guilherme  Peixoto    
 
– Summary  Week  1  
Goal  was  to  attempt  to  perform  simple  sentiment  analysis  classification.  
 
1.  Data  Collection  
1.1  Data  collected  from  many  sources:  
n http://help.sentiment140.com/for-­‐students/  provides  a  large  labeled  
twitter  dataset  for  training,  containing  roughly  1.6M  tweets  available.    
o Data  format:  
"4","2193574852","Tue  Jun  16  08:38:37  PDT  
2009","NO_QUERY","marshymiffy","Loved  the  USA  hockey  team  "  
§ 4:  positive  polarity  
§ 0:  negative  polarity  
§ Roughly  1:1  ratio  
n Data  queried  from  TweetTracker’s    DB  
o Roughly  100k  instances  
 
2.  Data  Processing  
2.1  –  Duplicates  removal  
n Specially  on  TweetTracker’s    dataset,  many  tweets  are  duplicate  
because  of  slight  differences  (such  as  a  retweet  or  a  different  
hyperlink)  
n Reduced  100k  tweets  collected  to  66k~  distinct  instances  
2.2  –  Text  cleansing    
n Removal  of  stop  words;  
n Removal  of  punctuation;  
n Removal  of  mentions  “@user”  and  hyperlinks  
n Stemming  remaining  words  
 
2.3  –  Preparation  for  model’s  input  
n Data  needs  to  be  prepared  for  algorithm’s  use  
n Model  of  choosing:  bag  of  words  with  bigrams  
o Text  goes  from:    
§ emilys parents are taking her to Cambodia as a
treat x emilys parents baffle me  
o To:
§ {u'Cambodia treat': 1, u'x emili': 1, u'parent':
1, u'parent baffl': 1, u'Cambodia': 1, u'emili':
1, u'take Cambodia': 1, u'baffl': 1, u'treat':
1, u'x': 1, u'emili parent': 1, u'treat x': 1,
u'parent take': 1, u'take': 1}

3.  Algorithm/Framework  Selection  
3.1  –  Algorithm  (Starter):  Naïve  Bayes  
n Good  baseline  to  measure  accuracy  
n Ease  of  use/implementation  
 
 3.2  –  Framework  Selection  
n Open  source  libraries  for  python  
o NLTK  and  Scikit-­‐Learn  
§ Naïve  Bayes  implementation  
§ StopWords  dictionary  
§ Stemmer  algorithm  (Porter  et.  al)  
o PyMongo  
§ Python’s  driver  for  MongoDB  
o Numpy  
§ Numerical  processing  
o cPickle  
§ Easy  object  storing  on  HD  
§ No  need  to  deal  with  files/parsing/IO  complications  
n IPython’s  interative  terminal  
 
4.  Training  
4.1  Choosing  training  data  
n From  Sentiment140’s  database:  
o The  full  bag  of  words  model  for  all  of  the  1.6M  tweets  uses  over  
500MB  of  RAM    
o Extremely  slow  
o No  necessity  in  using  all  of  the  1.6M  –  model  gets  too  complex  
o à  Utilized  only  one  tenth  of  the  data  
§ From  160k  tweets  left,  first  the  model  was  trained  on  70%  
and  tested  on  the  other  30%  
§ Accuracy:  75%  
§ “A  study  from  the  University  of  Pittsburgh  shows  that  
humans  can  only  agree  on  whether  or  not  a  sentence  has  the  
correct  sentiment,  80%  of  the  time”  [taken  from  
https://semantria.com/sentiment-­‐analysis]  
§ 75%  is  good  enough  for  a  simple  straightforward  model,  but  
it  can  be  improved  
 
5.  Testing  
n Testing  phase  over  TweetTracker’s  actual  data  
n 1/10th  of  Sentiment140’s  data  (160k  tweets)  was  used  as  training  data  
n 66k~  tweets  collected  from  TweetTracker  were  labeled  
o 62%  of  the  tweets  were  labeled  with  negative  polarity  (-­‐1)  
o Therefore  38%  with  positive  polarity  (+1)  
§ It  seems  reasonable,  since  all  of  the  66k~  tweets  were  in  one  
of  the  following  categories:  
• u'Ukrainian Protests', u'Myanmar', u'Humanity
Road', u'Suicide Bombings', u'Yosemite
Wildfire', u'Cambodia', u'Libya 2014',
u'Philippine Manila Flooding(Archived)',
u'Test', u'Syria', u'Public Safety Event',
u'Silver Fire'  
• (don’t expect many positive tweets…)  
6.  Problems  
n A  lot  of  tweets  are  “report”-­‐like,  but  due  to  the  high  polarity  of  the  
words,  are  labeled  as  negative  when  it  should  be  labeled  as  “neutral”  
o E.g.:  (Myanmar Sentences Muslim Man to 7 Years in
Prison via @GlobeNewsFeed) -1
§ Sentences,  prison,  Myanmar  have  a  negative  polarity
§ Tweet  should  be  labeled  neutral
§ Solution  proposal  à  add  a  third  category  “neutral”,  instead  of  
a  binary  problem
• Pick  either  positive/negative  polarity  only  if  the  
probability  of  the  label  association  is  above  certain  
threshold,  otherwise  label  as  neutral
n Different  languages
o Algorithm  doesn’t  perform  as  well  for  different  languages
§ No  stopwords  filtering,  no  stemming…
§ Labeling  is  random
§ NLTK  does  not  support  Unicode,  therefore  most  characters  in  
languages  that  do  not  use  roman  alphabet  are  simply  ignored
§ Solution  proposal  à  translation  to  English  plaintext  real  time  
before  performing  classification  task
n Naïve  Bayes  accuracy
o Too  simple,  too  straightforward
o Simply  a  baseline  to  guide  from
o Try  different  algorithms
§ KNN  can’t  be  used  due  to  its  slow  nature  and  it  would  be  
unfeasible  to  deploy  a  system  using  KNN  with  a  large  training  
dataset
§  Support  Vector  Machines,  Naïve  Bayes  based  on  
Multinomial/Gaussian  distribution,  Maximum  Entropy

Vous aimerez peut-être aussi