Vous êtes sur la page 1sur 59

Deep learning for Sentiment analysis

PRESENTER: HNG D PHAN INSTITUTION OF INFORMATION TECHNOLOGY

Outline
1. Introduction

2. Sentiment analysis approaches


3. Overview of deep learning for applications. 4. Deep learning for sentiment detection.

5. Future research direction

1. Introduction
Each sentence and paragraph contains it own sentiment feature. With sentence: This is a good movie positive comment.

This movie contains bad words, bad characters and unrelated scenes negative comments

1. Introduction

1. Introduction
Purpose of sentiment detection:

Classification the comment.


Extract relationship between sentences in a paragraph. Judgment and evaluation

Emotional state
intended emotional communication

Outline
1. Introduction

2. Sentiment analysis approaches


3. Overview of deep learning for applications. 4. Deep learning for sentiment detection.

5. Future research direction

2. Sentiment analysis approaches


Issues: Classifying the polarity of a given text at the document, sentence, or feature/aspect level Beyond polarity sentiment classification looks: angry, happy, sad, etc

Early work of Polarity detection:


Peter D. Turney [1]: The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverb. Bo Pang and Lillian Lee [2]: Exploiting class relationships for sentiment categorization with respect to rating scales. Benjamin Snyder and Regina Barzilay [3]: focus on restaurant reviews, analyzing specific aspect of restaurant.

Peter D. Turney [1]


Purpose: classification of film reviews. Provide a simple unsupervised learning algorithm for classifying reviews as recommended (thumbs up) or not recommended (thumbs down). The classification of a review is predicted by the average semantic orientation of the phrases in the review that contain adjectives or adverbs. In this paper, the semantic orientation of a phrase is calculated as the mutual information between the given phrase and the word excellent minus the mutual information between the given phrase and the word poor.

Peter D. Turney [1]


1) Identify phrase in input text contain ADJ and adverbs
Part of speech tagging

2) Estimate the semantic orientation of each extracted phrase 3) assign the given review to a class, recommended or not recommended,

Pointwise Mutual Information (PMI) and Information Retrieval (IR)

PMI-IR method
The Pointwise Mutual Information (PMI) between two words, word1 and word2, is defined as follows (Church & Hanks, 1989):

The Semantic Orientation (SO) of a phrase, phrase, is calculated here as follows:

Update the SO based on phrase in hits (matching in the document):

10

Peter D. Turney [1]

11

Peter D. Turney [1]


Disadvantages: average SO tends to err on the side of guessing that a review is not recommended, when it is actually recommended.

12

Bo Pang and Lillian Lee [2]


Determine authors evaluation with respect to a multi-point scale (one to five star). 2 main steps:

Evaluating human performance at the task


Applying a meta-algorithm, based on metric-labeling formulation of the problem, that alters a given n-ary classifiers output in an explicit attempt to ensure that similar item receive similar labels.

13

Bo Pang and Lillian Lee [2]


The idea of metric labeling is provided by JON KLEINBERG AND EVA TARDOS ([28]). Extract the cost of the labeling, which represents for the error in labelling as the total cost:

Metric labeling: minimize the cost.

14

Bo Pang and Lillian Lee [2]


Explicitly incorporates information about item similarities together with label similarity information (for instance, one star. is closer to .two stars. than to four stars.) is to think of the task as one of metric labeling (Kleinberg and Tardos, 2002), where label relations are encoded via a distance metric. To detect the similarity between items and labels, 3 algorithm has been researched based on Support Vector Machines: 1. One-vs-all 2. Regression 3. Metric labeling Consider what item similarity measure to apply, proposing one based on the positive-sentence percentage.

15

Bo Pang and Lillian Lee [2]


One-vs-all

Each training point belongs to one of N dierent classes. The goal is to construct a function which, given a new data point, will correctly predict the class to which the new point belongs [5].
(i) Solve K different binary problems: classify class k" versus the rest classes" for k = 1; .;K. (ii) Assign a test sample to the class giving the largest fk (x) (most positive) value, where fk (x) is the solution from the kth problem Purpose: Classify reviews as output labels (score rank) and evaluate the accuracy.

16

Bo Pang and Lillian Lee [2]


Regression

In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. Assuming that the labels come from a discretization of a continuous function g mapping from the feature space to a metric space ..

17

Bo Pang and Lillian Lee [2]


Regression

the idea is to find the hyperplane that best the training data, but where training points whose labels are within distance of the hyperplane incur no loss:
, is the negative of the distance between l and the value predicted for x by the filted hyperplane function Koppel and Schler (2005) found that applying linear regression to classify documents (in a different corpus than ours) with respect to a three-point rating scale provided greater accuracy than OVA SVMs and other algorithms.

18

Bo Pang and Lillian Lee [2]


Metric labeling

Let d be a distance metric on labels, and let nnk (x) denote the k nearest neighbors of item x according to some item-similarity.
Then, it is quite natural to pose our problem as finding a mapping of instances x to labels lx (respecting the original labels of the training instances) that minimize

19

Bo Pang and Lillian Lee [2]


To detect the similarity between item, a traditional measure has used: termoverlap-based measure such as the cosine between term-frequency-based document vectors. Ratings can be determined by the positive-sentence percentage (PSP) of a text, i.e., the number of positive sentences divided by the number of subjective sentences.

20

Benjamin Snyder and Regina Barzilay [3]


Input: in a restaurant review such opinions may include food, ambience and service Algorithm: The Grief algorithm- jointly learns ranking models for individual aspects by modeling the dependencies between assigned ranks . Analyzing meta-relations between opinions, such as agreement and contrast. Models the dependencies between different labels via the agreement relation.

21

Benjamin Snyder and Regina Barzilay [3]


M-aspect ranking model contains m+1 components ((w[1], b[1]),(w[m], b[m]), a). The first m components are individual ranking model, one for each aspect, the final is agreement model Predict a joint rank for the m aspects which satisfies the individual ranking models as well as the agreement model.

The decoder then predicts the m ranks which minimize the overall grief:

22

Benjamin Snyder and Regina Barzilay [3]

23

2. Sentiment analysis approaches


Objects to analysis: Text content (adjective, adverb). The accuracy of review. Multiple feature/aspect. Method: Extension of Support Vector Machine. Unsupervised learning

Disadvantage: The order of words is ignored and important information is lost.

24

Outline
1. Introduction

2. Sentiment analysis approaches


3. Overview of deep learning for applications. 4. Deep learning for sentiment detection.

5. Future research direction

25

3. Overview of deep learning for applications.


Deep learning is a set of algorithms in machine learning that attempt to learn in multiple levels of representation, corresponding to different levels of abstraction. It typically uses artificial neural networks. [11] Deep learning application:
Hand writing recognition. Speech processing.

26

Neural network
Artificial neural networks are models inspired by animal central nervous systems (in particular the brain) that are capable of machine learning and pattern recognition. They are usually presented as systems of interconnected "neurons" that can compute values from inputs by feeding information through the network. Main components:
Input, output Weight. Activation function.

27

The simplest model- the Perceptron

Learning:

28

Activation function
This is similar to the behavior of the linear perceptron in neural networks However, its a nonlinear function, which allows such networks to compute nontrivial problems using only a small number of nodes.

29

Types of Artificial Neural Network:.


Types of Artificial Neural Network: The feed forward neural network was the first and arguably most simple type of artificial neural network devised. In this network the information moves in only one direction forwards: From the input nodes data goes through the hidden nodes (if any) and to the output nodes. Recurrent neural networks (RNNs) are models with bi-directional data flow. While a feed forward network propagates data linearly from input to output, RNNs also propagate data from later processing stages to earlier stages. RNNs can be used as general sequence processors.

30

The Boltzmann machine


A Boltzmann machine is a network of units with an "energy" defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy, E, in a Boltzmann machine is identical in form to that of a Hopfield network:

Problems:
the time the machine must be run in order to collect equilibrium statistics grows exponentially with the machine's size, and with the magnitude of the connection strengths connection strengths are more plastic when the units being connected have activation probabilities intermediate between zero and one, leading to a so-called variance trap. The net effect is that noise causes the connection strengths to random walk until the activities saturate.

31

Restricted Boltzmann Machines (RBM)


Boltzmann Machines (BMs) are a particular form of log-linear Markov Random Field (MRF), i.e., for which the energy function is linear in its free parameters. Advantages: Not allow intralayer connection between hidden-hidden and between visible units. The energy function E(v,h) of an RBM is defined as:

32

Deep learning steps


Two main steps: 1. Pre-trained one layer at a time: treating each layer in turn as an unsupervised restricted Boltzmann machine (RBM). 2. Fine-tuning: using supervised back propagation. The resulting model is called a deep belief network, and may be built from other building blocks than RBMs

33

Deep believe network training


1. Train the first layer as an RBM that models the raw input x =h(0) as its visible layer. 2. Use that first layer to obtain a representation of the input that will be used as data for the second layer. Two common solutions exist. This representation can be chosen as being the mean activations p(h(1) =1| h(0) ) or samples of p(h(1) | h(0) ). 3. Train the second layer as an RBM, taking the transformed data (samples or mean activations) as training examples (for the visible layer of that RBM). 4. Iterate (2 and 3) for the desired number of layers, each time propagating upward either samples or mean values. 5. Fine-tune all the parameters of this deep architecture with respect to a proxy for the DBN loglikelihood, or with respect to a supervised training criterion (after adding extra learning machinery to convert the learned representation into supervised predictions, e.g. a linear classifier).

34

3.2. Deep learning application


Hand-writing recognition: The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples for the training set and 10,000 examples for testing. In Dan Claudiu Ciresand Ueli Meier [15]:
Multi layer perceptron (MLP). Train 5 MLPs with 2 to 9 hidden layers and varying numbers of hidden units. Mostly but not always the number of hidden units per layer decreases towards the output layer.

35

3.2. Deep learning application


In [15]:

36

3.2. Deep learning application


Speech recognition: In George Hilton [17], deep neural networks is used to make acoustic model for speech recognition. Most current speech recognition systems use hidden Markov models (HMMs) to deal with the temporal variability of speech and Gaussian mixture models (GMMs) to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input. To evaluate the fit: use a feed-forward neural network
Input: Frames of coefficients. Output: posterior probabilities over HMM states

37

Outline
1. Introduction

2. Sentiment analysis approaches


3. Overview of deep learning for applications. 4. Deep learning for sentiment detection.

5. Future research direction

38

4. Deep learning for sentiment analysis.


General approaches: use semantic word space. Semantic word spaces have been very useful but cannot express the meaning of longer phrases in a principled way. Solution: Sentiment Treebank, with 215,154 phrases in the parse trees of 11,855 sentences. Recursive Neural Tensor Network: predict compositional semantic effects present in new corpus

39

4. Deep learning for sentiment analysis.


Example of the Recursive Neural Tensor Network accurately predicting 5 sentiment classes, very negative to very positive ( , , 0, +, + +), at every node of a parse tree and capturing the negation and its scope in this sentence.

40

Recursive Neural Tensor Network (RNTN)


Represent a phrase through word vectors and a parse tree and then compute vectors for higher nodes in the tree using the same tensor-based composition function. Related area research:
Semantic Vector Spaces. Compositionality in Vector Spaces. Logical Form Deep Learning Sentiment analysis

41

Semantic Vector Spaces


The dominant approach in semantic vector spaces uses distributional similarities of single words. Variants of this idea use more complex frequencies such as how often a word appears in a certain syntactic context (Pado and Lapata, 2007; Erk and Pado, 2008). To overcome this, neural vector (Bengio, 2003) approach has been implemented.

42

Compositionality in Vector Spaces


Compositionality algorithms: related datasets capture two word compositions. :Mitchell and Lapata (2010) [24] two-word phrases and analyze similarities computed by vector addition, multiplication and others. Some related models: Holographic reduced representations (Plate, 1995- [21]). compositional matrix space model (Rudolph and Giesbrecht, 2010 [23])

43

Compositionality in Vector Spaces


Compositional matrix space model: Assigns ordinal sentiment scores to phrases. Account for critical interactions among the words in each sentiment-bearing phrase. The score of phrase i:
Wk : d word of phrase i

Represent phrase i

44

Compositionality in Vector Spaces


Compositional matrix space model (continue):

45

Compositionality in Vector Spaces


With Stanford system:

Recursive neural network (RNN) matrix-vector RNNs . New algorithm: Recursive Neural Tensor Network (RNTN).

46

Recursive Neural Model


Translate input text to vector. Compute parent vector in a bottom up fashion using different types of compositionality functions g. P2= g(p1, a)
..

P1= g(b, c)

. Not

very

good .

47

Recursive Neural Network


Two children vector is computed:
= (
) = ( )

f : tanh function, standard element-wise nonlinearity. Compute label value by soft-max classifier: = softmax( a)

48

Matrix-Vector Recursive Neural Network (MV-RNN)


Represent every word and longer phrase in a parse tree as both a vector and a matrix.
= ( ) = ( )

The word matrices can capture compositional effects specific to each word, whereas W captures a general composition function. Nonlinear functions to compute compositional meaning representations for multi-word phrases or full sentences. Disadvantage: the number of parameters becomes very large and depends on the size of the vocabulary [26]

49

Recursive Neural Tensor Network (RNTN)


Provide an interaction that would allow the model to have greater interactions between the input vectors. RNTN: The main idea is to use the same, tensor-based composition function for all nodes. A single layer tensor:

50

Recursive Neural Tensor Network (RNTN)

51

Tensor Backprop through Structure


The error as a function of the RNTN parameters = (V;W;Ws;L) for a sentence is:

The full derivative for slice V[k] for this tri-gram tree then is the sum at each node:

52

Recursive Neural Tensor Network (RNTN)

53

Stanford Sentiment analysis source code


Have library in java, C# and python Extract from input text:
POS, NER: CRF tagging. Parsed sentiment tree.

Online demo:
nlp.stanford.edu:8080/sentiment/rntnDemo.html http://nlp.stanford.edu/sentiment/treebank.html

54

Stanford Sentiment analysis source code


Input text: Stanford University is located in California. It is a great university, founded in 1891.
<document> <sentences> <sentence id="1"> <tokens> <token id="1"> <word>Stanford</word> <lemma>Stanford</lemma> <CharacterOffsetBegin>0</CharacterOffsetBegin> <CharacterOffsetEnd>8</CharacterOffsetEnd> <POS>NNP</POS> <NER>ORGANIZATION</NER> <Speaker>PER0</Speaker> </token> <parse>(ROOT (S (NP (PRP It)) (VP (VBZ is) (NP (NP (DT a) (JJ great) (NN university)) (, ,) (VP (VBN founded) (PP (IN in) (NP (CD 1891)))))) (. .))) </parse> <dependencies type="basic-dependencies"> <dep type="root"> <governor idx="0">ROOT</governor> <dependent idx="5">university</dependent> </dep> <dep type="nsubj"> <governor idx="5">university</governor> <dependent idx="1">It</dependent> </dep>

55

Stanford Sentiment analysis source code

56

Outline
1. Introduction

2. Sentiment analysis approaches


3. Overview of deep learning for applications. 4. Deep learning for sentiment detection.

5. Future research direction

57

5. Future research direction


Overview of deep learning in sentiment detection.

Other sentiment analysis researches: Sentiment Treebank. Paragraph positive/negative detection.


With researches in Vietnamese language: Vietnamese Treebank (VLSP). Word and phrase processing.

58

THANK YOU FOR YOUR ATTENTION

59

Vous aimerez peut-être aussi