Académique Documents
Professionnel Documents
Culture Documents
2/2012 81
1 Introduction
The development of internet and web 2.0
technologies, enabled by cost reduction of
special mechanisms, taken from areas such as
information retrieval - IR (Information
Retrieval) and information extraction - IE
technological infrastructure, has been an (Information Extraction), to obtain data and
exponential increase in the amount of to pre-process them to apply data mining
information in online systems. These very techniques. From the terms of proposed
large volumes of information are very objectives, Web mining is divided into three
difficult to process by individuals, leading to categories:
information overload and affecting decision- • Web structure mining - knowledge
making processes in organizations. discovery from hyperlinks to maximize
Therefore, providing new techniques for information about the relations between
creation of knowledge is important in web pages;
organizational strategy [1]. Knowledge • Web usage mining - extracting models
discovery using automated techniques is and patterns of users, from web logs
relevant for companies’ success, promoting (logs), that stores data access and
research in the development of activities of each visitor to a website,
methodologies, techniques and systems for detecting website users requirements;
extracting knowledge from data warehouses • Web content mining - extracting
and data mining [2]. knowledge from web page content [5].
A large volume of information in current Text mining techniques use methods of
online systems is stored in text form. This is knowledge discovery in unstructured text
the way information is transmitted on the data. These techniques performing the
internet, being the most natural process of extracting information from the
representation form and easier to read by the text unstructured format, are used in research
people [3]. In this context, applying of data areas such as natural language processing
mining techniques to the content of web (NLP), artificial intelligence and machine
pages, (text mining on web pages) or content learning. The techniques are applied in
mining become important [4]. document classification, content spam
Between web mining and data mining are detection and trend analysis.
important differences in terms of data
collection. In data mining is assumed that the
data is already collected and stored in
databases, while in web mining, are used
82 Informatica Economică vol. 16, no. 2/2012
the words or phrase that expresses Social networks like Twitter and Facebook
feelings. The process is conducted in have become a large scale public information
three stages [10]: source that cannot be ignored. People use
a. Entity determination – identification of them to express their sentiments on various
texts that contain information about the topics. Applying sentiment analysis on these
entity under review; reviews and their automated classification on
b. Determination of sentiments – for the positive, negative and neutral categories can
text of the previous stage is considered provide valuable information to companies as
the content of opinions and sentiments, market reports.
by searching a set of words carrying
sentiments, or by prior training a 4 Stages of sentiment analysis
classifier; To carry out sentiment analysis are necessary
c. Determination of entity - sentiment several steps, in which are applied various
relationship - at this stage is analyzed techniques and methodologies:
whether opinions extracted are a) Data collection and pre-processing
addressed to the entity under review. In this stage it is acquired the text that will be
Usually this is done through a analyzed for detection of opinions. It is
predefined list of patterns. important, according to the methodology
5. Opinion mining process is centered on a used, to eliminate all matters that not express
domain, so, a solution determined for a opinions. In this phase, pre-processing is
given area (i.e. movie reviews analysis), done to eliminate unnecessary words or
will not work on another (e.g. foto- irrelevant opinions. It is necessary to extract
camera). The way of expressing feelings keywords from the text which can provide an
varies from one domain to another, the accurate classification. These keywords are
developed model requiring adaptation. usually stored as an array of features A =
(A1, A2, ..., An). Each element of array is a
3 Applications of sentiment analysis word from the original text, called aspect
There are many areas where sentiment (feature). For every feature, can exist a
analysis can be used as the following: binary value, indicating the presence or
- sentiment analysis on financial markets. For absence of the feature, or a value that
investors is important information the expresses the frequency of appearance in the
analysts and other investors opinions about text. Selecting common aspects is necessary
the stocks of a company, to identify price so that they express those relevant opinions
trends. for sentiment analysis.
- sentiment analysis on products. A company b) Classification
is interested in customers' perceptions about In this phase, content polarity is identified.
its products. Information may be used to Usually three classes are used for
improve products and identifying new classification: positive, negative and neutral.
marketing strategies [11]. Classification algorithms used for sentiment
- sentiment analysis on a location. Tourists analysis, depending on the method used
want to know the best places to visit or (supervised or unsupervised), require a
famous restaurants. Applying sentiment training set with pre marked examples. It is
analysis can be obtained relevant important to train the model used for
information for planning a trip. classification with domain-specific data.
- sentiment analysis on elections. Using Marking is done by expressing subjectivity
sentiment analysis we can identify the voter and polarity of training sets.
opinions about a certain candidate. c) Aggregation and presentation of
- analysis on movies or software programs. results
We can detect users' sentiments from At this stage, opinions classification process
posted reviews on specialized sites. result obtained at the previous stage is
84 Informatica Economică vol. 16, no. 2/2012
subjected to a process of aggregation after The paper aims to analyze user comments on
some algorithms to determine the general movies, in order to determine the positive or
opinion of the analyzed text. Presentation can negative opinion. Sites specialized in this
be done directly expressing sentiment textual field, like RottenTomatos.com and
or using graphics. Imdb.com, contain from several tens to
several thousands of reviews for each movie
5 Proposed opinions mining method that we can extract with a crawler.
Results opinion
reviews
In this case we will use, for training, a (A) and P (B) and conditional probability of
collection of comments (sentence polarity event A conditioned by B and event B
dataset v1.0) already extracted from these conditioned by A, P (A | B) and P (B | A).
sites and available online at Thus Bayes’ formula is [13]:
http://www.cs.cornell.edu/people/pabo/
movie-review-data/ [12]. This collection 𝑃(𝐵|𝐴)𝑃(𝐴)
𝑃(𝐴|𝐵) =
contains 5331 sentences already classified as 𝑃(𝐵)
positive and 5331 negative opinions from
2000 comments processed and classified in This theorem enables us to determine a
two categories. Comments usually contain conditional probability having the probability
several sentences, but opinion will be of contrary event and independent
determined at sentence level, then later probabilities of events. Thus, we can estimate
determining overall comment opinion. the probability of an event based on the
Obtained collection consists of two files, one examples of its occurrence. Thus, we can
for each set of positive and negative estimate the probability of an event based on
opinions, containing one sentence per line, the examples of its occurrence. In this case,
making it easy to process. To extract we estimate probability that a document is
opinions we will use a Naive Bayesian positive or negative, in a certain context, or
classifier. This type of classifier has the the likelihood that an event to take place if it
advantage that it is easy to implement, was predetermined to be positive or negative.
quickly and generate good results. This is facilitated by the collection of
positive and negative examples chosen. The
6 Using Naive Bayes algorithm process is naive Bayesian because of how we
The Naive Bayes classifier is a probability calculate the probability of occurrence of an
classifier, based on Bayes' theorem. Bayes' event - is the product of probability of
theorem specifies mathematically the relation occurrence of each word in the document.
between probability of two events A and B, P This presumes that there is no connection
Informatica Economică vol. 16, no. 2/2012 85
between the words. This assumption of series of positive and negative examples and
independence is introduced to facilitate the calculating the frequency of each of the
construction of classifier, it is not entirely classes. This learning process is supervised,
true, and there are words that appear together requiring the existence of pre-classification
more frequently than individual. examples for training.
We estimate the probability of a word with Starting from:
positive or negative meaning by analyzing a
𝑃(𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)𝑃(𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒|𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)
𝑃(𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡|𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒) =
𝑃(𝑠𝑒𝑛𝑡𝑒𝑛𝑐𝑒)
𝑃(𝑤𝑜𝑟𝑑|𝑠𝑒𝑛𝑡𝑖𝑚𝑒𝑛𝑡)
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒𝑠 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 + 1
=
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑏𝑒𝑙𝑜𝑔𝑖𝑛𝑔 𝑡𝑜 𝑎 𝑐𝑙𝑎𝑠𝑠 + 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠
the training collection, starts by calculating each of the two classes, we compute a score
the prior probability, before word analysis, by successive multiplying probability of each
based on the number of positive and negative word belongs to one of two classes. Finally
examples. In this case, the collections being we return the class with the highest score, the
equal, the probability is of 0.5. Every one in which it will classify the sentence.
sentence is tokenized into words, and for
Method antreneazaNB(), train algorithm into words and stores the frequency of each
using input classified data collections. It word in both positive and negative classes for
reads preclasificate collections of files, loops classification:
through the test examples, tokenize sentences
}
while($line = fgets($fh)) {
if($limit > 0 && $i > $limit) {
break;
}
$i++;
$this->nrDoc++;
$this->nrDocClasa[$clasa]++;
$tokens = $this->tokenise($line);
$tokens = $this->ngram($tokens,2);
foreach($tokens as $token) {
$st = new Stemmer();
$token=$st->stem($token);
if (!$this->toExclude($token)){
if(!isset($this->index[$token][$clasa])) {
$this->index[$token][$clasa] = 0;
}
$this->index[$token][$clasa]++;
$this->nrCvClasa[$clasa]++;
$this->nrCuv++;
}
}
}
fclose($fh);
}
It will be constructed a procedure which will classes, and ultimately, it will determine the
take comment and will split into sentences. overall score of comment and its polarity by
Each sentence will be evaluated to establish returning the index of the array with the
the determinant sentiment by the vector highest value:
$scor with indexes corresponding to the two
We present the result returned by the program for the next comment:
array
'pos' => int 3
'neg' => int 1
Comment Classification:- pos
(precision) and the recall, both comparing the concepts will be used contingency table [14]
results with relevance. To express these as follows:
<?php
$op = new naiveBayes();
$op-> antreneazaNB ('opinion/rt-polaritydata/rt-polarity.neg', 'neg', 5000);
$op->antreneazaNB ('opinion/rt-polaritydata/rt-polarity.pos', 'pos', 5000);
$i = 0; $tn = 0; $fn = 0; $tp = 0; $fp = 0;
$fh = fopen('opinion/rt-polarity.neg', 'r');
while($line = fgets($fh)) {
if($i++ > 5001) {
if($op->clasifica($line) == 'neg') {
$tn++;
} else {
$fp++;
}
}
}
$fh = fopen('opinion/rt-polarity.pos', 'r');
while($line = fgets($fh)) {
if($i++ > 5001) {
if($op->classifica($line) == 'pos') {
$tp++;
} else {
$fn++;
}
}
}
echo "<br />Precizie: " . ($tp / ($tp+$fn));
echo "<br />Recall : " . ($tp / ($tp+$fn));
echo "<br />Accuracy: " . (($tp+$tn) / ($tp+$tn+$fp+$fn));
?>
Informatica Economică vol. 16, no. 2/2012 89
The following chart presents the influence of point the increase number of the initial
data training volume on the accuracy of training set, will produce very little influence
classifications in the method used. We detect on precision of the algorithm.
a critical number of training data, from which
90 Informatica Economică vol. 16, no. 2/2012