Académique Documents
Professionnel Documents
Culture Documents
presented at the
Jeffrey Breen
President Cambridge Aviation Research jbreen@cambridge.aero June 2011
http://www.theacsi.org/
Actually, they rank below the Post Office and health insurers
@United Weather delays may not be your fault, but you are in the customer service business. It's atrocious how people are getting treated!
We were just told we are delayed 1.5 hrs & next announcement on @JetBlue We're selling headsets. Way to capitalize on our misfortune. @SouthwestAir I hate you with every single bone in my body for delaying my flight by 3 hours, 30mins before I was supposed to board. #hate @SouthwestAir I know you don't make the weather. But at least pretend I am not a bother when I ask if the delay will make miss my connection
Hey @delta - you suck! Your prices are over the moon & to move a flight a cpl of days is $150.00. Insane. I hate you! U ruined my vacation!
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
14
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
15
A list in R is a collection of objects and its elements may be named or just numbered.
tweet is an object of type status from the twitteR package. It holds all the information about the tweet returned from Twitter.
The help page (?status) describes some accessor methods like getScreenName() and getText() which do what you would expect:
> tweet$getScreenName() [1] "Alaqawari" > tweet$getText() [1] "I am ready to head home. Inshallah will try to get on the earlier flight to Fresno. @Delta @DeltaAssist"
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
19
Estimating Sentiment
There are many good papers and resources describing methods to estimate sentiment. These are very complex algorithms.
For this tutorial, we use a very simple algorithm which assigns a score by simply counting the number of occurrences of positive and negative words in a tweet. The code for our score.sentiment() function can be found at the end of this deck. Hu & Liu have published an opinion lexicon which categorizes approximately 6,800 words as positive or negative and which can be downloaded. Positive: love, best, cool, great, good, amazing Negative: hate, worst, sucks, awful, nightmare
20
http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html
2. Loading data is one of Rs strengths. These are simple text les, though they use ; as a comment character at the beginning:
> hu.liu.pos = scan('../data/opinion-lexicon-English/positivewords.txt', what='character', comment.char=';') > hu.liu.neg = scan('../data/opinion-lexicon-English/negativewords.txt', what='character', comment.char=';')
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
22
data.frames hold tabular data so they consist of columns & rows which can be accessed by name or number. Here, score is the name of a column.
So, not so good with sarcasm. Here are a couple of real tweets:
> score.sentiment(c("@Delta I'm going to need you to get it together. Delay on tarmac, delayed connection, crazy gate changes... #annoyed", "Surprised and happy that @Delta helped me avoid the 3.5 hr layover I was scheduled for. Patient and helpful agents. #remarkable"), pos.words, neg.words)$score [1] -4 5
Accessing data.frames
Heres the data.frame just returned from score.sentiment():
> result
score 1 2 3 text
2 -5
You're awesome and I love you I hate and hate and hate. So angry. Die!
4 Impressed and amazed: you are peerless in your achievement of unparalleled mediocrity.
Lets add two new columns to identify the airline for when we combine all the scores later:
> delta.scores$airline = 'Delta' > delta.scores$code = 'DL
ggplot2s faceting capability makes it easy to generate the same graph for different values of a variable, in this case airline.
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
30
For each airline ( airline + code ), lets use the ratio of very positive to very negative tweets as the overall sentiment score for each airline:
> twitter.df = ddply(all.scores, c('airline', 'code'), summarise, pos.count = sum( very.pos ), neg.count = sum( very.neg ) ) > twitter.df$all.count = twitter.df$pos.count + twitter.df$neg.count > twitter.df$score = round( 100 * twitter.df$pos.count / twitter.df$all.count )
http://www.theacsi.org/index.php?option=com_content&view=article&id=147&catid=&Itemid=212&i=Airlines 18
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
33
Game Plan
Search Twitter for airline mentions & collect tweet text Load sentiment word lists Compare Twitter sentiment with ACSI satisfaction score Scrape ACSI web site for airline customer satisfaction scores Score sentiment for each tweet Summarize for each airline
35
Unless you specify all=T, non-matching rows are dropped (like a SQL INNER JOIN), and thats what happened to top scoring JetBlue. With a very low score, and low traffic to boot, soon-to-disappear Continental looks like an outlier. Lets exclude:
> compare.df = subset(compare.df, all.count > 100)
an actual result!
ggplot will even run lm() linear (and other) regressions for you with its geom_smooth() layer:
> ggplot( compare.df ) + geom_point(aes(x=score.twitter, y=score.acsi, color=airline.twitter), size=5) + geom_smooth(aes(x=score.twitter, y=score.acsi, group=1), se=F, method="lm") + theme_bw() + opts(legend.position=c(0.2, 0.85))
37 21
http://www.despair.com/cudi.html
39