Vous êtes sur la page 1sur 24

Tweet Classification

Mentor: Romil Bansal
Guided by : Dr. Vasudeva Varma
GROUP NO-37

Manish
Jindal(201305578)
Trilok
Sharma(201206527)

Problem Statement : To automatically
classify Tweets from Twitter into various genres
based on predefined Wikipedia Categories.

Motivation:
o Twitter is a major social networking service with

over 200 million tweets made every day.
o Twitter provides a list of Trending Topics in real
time, but it is often hard to understand what these
trending topics are about.
o It is important and necessary to classify these
topics into general categories with high accuracy
for better information retrieval.

o It will also give the accuracy of the algorithm used for classifying tweets. .Data Dataset : o Input Data is the static / real-time data consisting of the user tweets. o Training dataset : Fetched from twitter with twitter4j api. Final Deliverable: o It will return list of all categories to which the input tweet belongs.

Categories We took following categories into consideration for classifying twitter data. 1)Business 2)Education 5)Law 9)Politics 6)Lifestyle 10)Sports 3)Entertainment 7)Nature11)Technology 4)Health 8)Places .

Keyword Stemming To reduce inflected words to their stem. which. such as the. base or root form using porter stemming Cleaning crawl data .Concepts used for better performance Outliers removal To remove low frequent and high frequent words using Bag of words approach . at. is. and on. Stop words removal To remove most common words.

Other Concepts used . Synonym form If feature(word) of test query not found as one of dimension in feature space than replace that word with its synonym. Done using WordNet. . Named Entity Recognition: For ranking result category and finding most appropriate. Spelling Correction To correct spellings using Edit distance method..

Tweets Classification Algorithms We used 3 algorithms for classification 1) Naïve based 2) SVM basedSupervised 3) Rule based .

WordNet (synonyms) Create feature vector for test tweet Create Index file Of feature vector . Stop word removal Edit Distance.Crawl tweeter data Tweets Cleaning. Stop word removal Remove Outliers Create feature vector for each tweet Create Index file Of feature vectors Training Output Category Rank result category Apply Named Entity Recognition Create / Apply Model files Testing Extract Features (Unique wordlist) Test Query/ Tweet Tweets Cleaning.

Main idea for Supervised Learning Assumption: training set consists of instances of different classes described cj as conjunctions of attributes values Task: Classify a new instance d based on a tuple of attribute values into one of the classes cj  C Key idea: assign the most probable class using supervised learning algorithm. .

.Method 1 : Bayes Classifier Bayes rule states : Likelihood Prior Normalization Constant We used “WEKA” library for machine learning in Bayes Classifier for our project.

Score > t: yes Score < -t: no 0 -1 1 11 . compute score: wTx + b = ΣαiyixiTx + b  Decide class based on whether < or > 0 Can set confidence threshold t. we can score its projection onto the hyperplane normal: I.e.Method 2 : SVM Classifier (Support Vector Machine) Given a new point x..

Multi-class SVM 12 .

k Y SVMs are trained. 1996) 13 . 1995) 1-against-1 Pair-wise.Multi-class SVM Approaches 1-against-all Each of the SVMs separates a single class from all remaining classes (Cortes and Vapnik. k(k-1)/2. Each SVM separates a pair of classes (Fridman.

.Advantages of SVM High dimensional input space Few irrelevant features (dense concept) Sparse document vectors (sparse instances) Text categorization problems are linearly separable For linearly inseparable data we can use kernels to map data into high dimensional space. so that it becomes linearly separable with hyperplane.

b. Count term frequency of each feature .Method 3 : Rule Based We defined set of rule to classify a tweet based on term frequency. category which is near to tweet will be our next classification. Extract the features of a tweet. the feature having maximum term frequency from all categories mentioned above will be our first classification. As it cannot be right all time so now we maintain count of categories in which tweet falls . . a.  c.

sachin.ExampleTweet=sachin is a good player.was.eats. who eats apple and banana which is good for health.he.who Classification.apple.banana Stop word-is.a.which.for.and.health. Feature.good.player.Feature-category termfrequency sachin-sports 2000 player-sports 900 eating-health 500 apple-technology 1000 health-health 800 banana-health 700 .

.sachin So our category is .e.sports 2nd approximation Max feature is laying in health i. 3 times .Max term-frequency . i.e. So our second approximation would be health. If both of these are in same category then we have only one category. if here max feature would be laying in sports than we have only one result that is sports.

the remainder for training Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate .Cross-validation (Accuracy) Steps for k-fold cross-validation : Step 1: split data into k subsets of equal size Step 2 : use each subset in turn for testing.

57 81.91 81.93 Law 81. SVM Naïve Rule Business 86.44 98.71 76.71 82.8 Entertainment 86.8 79.07 81.27 89.30 Education 85.1 87.17 73.05 .42 Nature 87.64 84.35 80.24 Places 81.11 83.Accuracy Results ( 10 folds) Accuracy of Algorithm in % Categories\ Algo.64 82.6 81.62 90.49 Health 95.01 75.44 77.88 76.73 Politics 81.67 84.25 Lifestyle 93.31 Sports 87.38 75.0 78.87 Technology 83.

Unique features Worked on latest Crawled tweeter data using tweeter4j api Worked on Eleven different Categories. Applied three different method of supervised learning to classify in different categories. Achieved high performance speed with accuracy in range of 85 to 95 % Done Tweets Cleaning . . Stop Word removal. Stemming . Used Edit distance for spelling correction. Used WordNet for Query Expansion and Synonyms finding. Used Named entity recognition for ranking. Validated using CrossFold (10 fold) validation.

Snapshot .

Result .

Accuracy .

Thank You! .