Académique Documents
Professionnel Documents
Culture Documents
ON
GUIDED BY
Ms. PARUL YADAV
SUBMITTED BY
DIKSHA MAHAJAN (25011503110)
CERTIFICATE
This is to certify that the project entitled SENTIMENT ANALYSIS &
INFORMATION EXTRACTION is the original work carried out by Diksha Mahajan
(25011503110) student of B.Tech (IT), BVCOE, affiliated to GGSIPU, during the year 2014, in
partial fulfillment of the requirements for the award of the Degree in Bachelor of Technology,
Information Technology and that the project has not formed the basis for the award previously of
any degree, diploma, associateship, fellowship or any other similar title.
1. Objective
1.1.
2.
2.1.
Abstract:
The project aims at providing a sentiment analysis system through a web interface that enables
web users, analysts and product managers to get insights into public sentiment on particular
products and services. The project makes extensive use of product and services review sites and
forums like IMDB, as well as micro blogging sites like Twitter. The system aims to apply
efficient information retrieval algorithms, as well as do the complex task of feature extraction for
a
more
drilled
down
analysis,
in
the
most
efficient
way.
Introduction
What is Part of Speech Tagging and how we implemented it?
In the collection of linguistics Part of Speech tagging is also called grammatical
tagging or word category disambiguation, in which we discern the words
according to their category eg in English dividing words in categories of noun,
verbs, prepositions etc. Part of Speech tagging is now been performed in the
context of computer linguistics using algorithms built on Hidden Markov Model,
Decision table, Dynamic Programming Models, Unsupervised Taggers etc.It
comes in Natural Language Processing and a lot of successful contribution has
been made under this topic
A Part-Of-Speech Tagger (POS Tagger) is a piece of software that reads text in
some language and assigns parts of speech to each word (and other token), such
as noun, verb, adjective, etc., although generally computational applications use
more fine-grained POS tags like 'noun-plural'. We used Stanford POS tagger, this
software is a Java implementation of the log-linear part-of-speech taggers
developed
by
stanford
engineers
and
researchers.
2.2.
2.2.1.
2.2.2.
2.2.2.1.
2.2.3.
2.2.3.1.
2.3.
2.3.1.
2.3.2.
2.3.3.
2.3.4.
2.3.5.
2.3.6.
2.4.
3.
Tools to use:
Wekaparallel
Algorithm followed:
Generate the imdb movie review url for the movie.
Download all the reviews web pages from IMDB.
Apply POS tagging on the downloaded movie reviews to get all the proper nouns like
"leonardo", "acting", "direction", "oscars" etc.
Identify all the actors, actresses, directors and movie names present in the above generated list
(in 3rd point).
Extract all the sentences which have the above generated keywords (as generated in 4th point).
Apply sentiment analysis
on the sentences
extracted from above step.
IMDBCrawler:
We made an IMDB review extracter as IMDB does not provide any API for extracting reviews.
We used an API provided which gives the imdb id for that movie, after that we download that
web page and store the results. We used Jsoup java library for downloading web content and
applying complex pattern matching on that text.
Handouts:
4.
Progress:
5.
S.NO
TASKS
ATTEMPTED
STATUS
Feature Extraction
1.1
Actors
Yes
Completed
1.2
Actresses
Yes
Completed
1.3
Directors
Yes
Completed
1.4
Movies
Yes
Completed
Crawler
2.1
IMDB
Yes
Completed
2.2
Rotten Tomatoes
No
2.3
GSM Arena
No
Algorithm
3.1
POS Integration
Yes
Completed
3.2
Sentiment Analysis
No
3.3
Entity Recognition
No
User Interface
4.1
Main Module
Yes
In Progress
4.2
Contribution Module
No
4.3
Project Wiki
No
References:
[1] Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging
with a cyclic dependency network. In: NAACL 3. (2003) 252259
[2]Christopher D. Manning. 2011.: Part-of-Speech Tagging from 97% to 100%: Is It Time for
Some Linguistics? Computational Linguistics and Intelligent Text Processing , 12th
International Conference, CICLing 2011
[3] Shen, L., Satta, G., Joshi, A.: Guided learning for bidirectional sequence classication. In:
ACL 2007. (2007)
[4]Spoustova, D.j., Hajic, J., Raab, J., Spousta, M.: Semi-supervised training for the averaged
perceptron POS tagger. In: Proceedings of the 12th Conference of the European Chapter of the
ACL (EACL 2009). (2009) 763771
[5]Sgaard, A.: Simple semi-supervised training of part-of-speech taggers. in proceedings of
the ACL 2010 Conference Short Papers. (2010)
[6] B Pang, L Lee .: Opinion mining and sentiment analysis, In:Foundations and trends in
information retrieval, 2008 - dl.acm.org
[7] Changhua Yang, Kevin Hsin-Yih Lin, Hsin-Hsi Chen, .: Building emotion lexicon from
weblog corpora in proceedings of ACL '07 ACL on Interactive Poster and Demonstration
Sessions
[8] Alec Go, Lei Huang, and Richa Bhayani. 2009 .:Twitter sentiment analysis. Final Projects
from CS224N for Spring 2008/2009 at The Stanford Natural Language Processing Group.