Académique Documents
Professionnel Documents
Culture Documents
INTRODUCTION
Text classification is a valuable application for machine learning algorithms. The goal of text
classification is to place unencountered pieces of text into a category predefined by the
classification problem. Within text classification, sentiment analysis of text is a widely desired
goal. Sentiment analysis is a problem where, depending on its attributes, text is ultimately
placed into a positive or negative class. Accurate sentiment analysis of text can lead to deep
insights of the real world. For example, a company like Amazon can estimate the popularity of
its products by running sentiment analysis on product reviews left by customers, and use this
knowledge to increase profits. Therefore, much of cuttingedge research and development is
happening in the realm of sentiment analysis.
There are several different supervised machine learning techniques that can be used for
sentiment analysis. In a paper titled “Baselines and Bigrams: Simple, Good Sentiment and
Topic Classification”, Christopher Manning and Sida Wang compare the effectiveness of
evaluate the potency of Support Vector Machines (or SVMs) in accurately classifying text as
positive or negative. The paper, however, does not get into an indepth analysis of how the
accuracy of Support Vector Machines in sentiment analysis depends on the tuning of certain
parameters used by the SVMs. The purpose of this project is to analyse how the accuracy of a
SVM performing sentiment analysis varies with changes in hyperparameter values.
II. DATASETS
To study the performance of SVMs, 2 different datasets were used. They are:
(i) the 2000 full length movie reviews dataset3, and
(ii) the IMDB movie reviews dataset4
III. METHODOLOGY
The project studies the performance of SVMs with (i) a linear kernel, and (ii) a radial kernel,
across the unigram and bigram cases of 2 different datasets. The data in both datasets was
tfidf normalized.
(i) Linear SVMs:
For linear SVMs, parameters considered included the regularization parameter (C) and the
penalty parameter (either ‘l1’ or ‘l2’). We used a wide range of C values ranging from 0.0001 to
about 13421, and doubled each C value to obtain the next one. Therefore the list of C values
used across all Linear SVM classifications in this project is:
0.0001, 0.0002, 0.0004, 0.0008, 0.0016, 0.0032, 0.0064, 0.0128, 0.0256, 0.0512, 0.1024,
0.2048, 0.4096, 0.8192, 1.6384, 3.2768, 6.5536, 13.1072, 26.2144, 52.4288, 104.8576,
209.7152, 419.4304, 838.8608, 1677.7216, 3355.4432, 6710.8864, 13421.7728
For each C value, the accuracy of the SVM is the mean/average of the individual errors
measured over 10fold cross validation.
(ii) Radial SVMs:
For radial SVMs, parameters considered included the regularization parameter (C) and the
kernel coefficient (gamma). We used values ranging from 0.0001 to 10000 for both C and
gamma. It is important to note that the values increased by a factor of 10 from one to the next.
The list of C values used for all radial SVMs in this project is:
0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000, 10000
Similarly, the list of gamma values used for all radial SVMs in this project is:
0.0001, 0.001, 0.01, 0.1, 1.0, 10, 100, 1000, 10000
For each C value, the accuracy of the SVM is the mean/average of the individual errors
measured over 10fold cross validation. In each case, based on the accuracies for different C
and gamma values.
The SVMs used in this project were not created from scratch. Instead, a python based library
graphing library called MatPlotlib, and all development was done using Python 2.7.6.
IV. GRAPHS
Fig 1: Lin. SVM Accuracy vs Log(C)Base(2) w/ l2 and Unigrams 2K movie revs dataset
Fig 2: Lin. SVM Accuracy vs Log(C)Base(2) w/ l1 and Bigrams 2K movie revs dataset
Fig 3: Lin. SVM Accuracy vs Log(C)Base(2) w/ l1 and Unigrams 2K movie revs dataset
Fig 4: Lin. SVM Accuracy vs Log(C)Base(2) w/ l1 and Bigrams 2K movie revs dataset
Fig 5: Lin. SVM Accuracy vs Log(C)Base(2) w/ l2 and Unigrams IMDB dataset
Fig 6: Lin. SVM Accuracy vs Log(C)Base(2) w/ l2 and Bigrams IMDB dataset
Fig 7: Lin. SVM Accuracy vs Log(C)Base(2) w/ l1 and Unigrams 2K movie revs dataset
Fig 8: Lin. SVM Accuracy vs Log(C)Base(2) w/ l1 and Bigrams 2K movie revs dataset
V. DATA TABLES
0.0001 49 49 49 49 49 49 49 77 84
0.001 49 49 49 49 49 49 77 84 82
0.01 49 49 49 49 49 77 84 82 82
0.1 49 49 49 49 77 84 82 82 82
1.0 49 49 49 49 81 83 83 83 83
10 49 49 49 49 51 52 52 52 52
100 49 49 49 49 49 49 49 49 49
1000 49 49 49 49 49 49 49 49 49
10000 49 49 49 49 49 49 49 49 49
Table 1: Rad. SVM Accuracy table with Log(gamma)Base(10) across rows and
Log(C)Base(10) across columns using Unigrams 2K reviews
* 0.0001 0.001 0.01 0.1 1.0 10 100 1000 10000
0.0001 49 49 49 49 49 49 49 49 90
0.001 49 49 49 49 49 49 77 90 90
0.01 49 49 49 49 49 77 90 90 90
0.1 49 49 49 49 86 88 88 88 88
1.0 49 49 49 49 49 49 49 49 49
10 49 49 49 49 49 49 49 49 49
100 49 49 49 49 49 49 49 49 49
1000 49 49 49 49 49 49 49 49 49
10000 49 49 49 49 49 49 49 49 49
Tab. 3: Rad. SVM Accuracy table with Log(gamma)Base(10) across rows and
Log(C)Base(10) across columns using Bigrams 2K reviews
Table 3: Rad. SVM Accuracy table with Log(gamma)Base(10) across rows and
Log(C)Base(10) across columns using Bigrams
0.0001 48 48 48 48 48 48 48 54 90
0.001 48 48 48 48 48 48 54 90 90
0.01 48 48 48 48 48 54 90 90 90
0.1 48 48 48 48 49 89 89 89 89
1.0 48 48 48 48 86 88 88 88 88
10 48 48 48 48 48 48 48 48 48
100 48 48 48 48 48 48 48 48 48
1000 48 48 48 48 48 48 48 48 48
10000 48 48 48 48 48 48 48 48 48
Table 3: Rad. SVM Accuracy table with Log(gamma)Base(10) across rows and
Log(C)Base(10) across columns using Bigrams
VI. DISCUSSION
As is clear through the graphs in section IV, ‘l2’ penalty tends to yield more accurate SVM
classification than ‘l1’ penalty, and this can be seen across both datasets. The average graph
height for all 4 ‘l2’ graphs are clearly higher than the average graph height for their
corresponding ‘l1’ versions which, in turn, translates into higher accuracy. An interesting find is
the fundamental difference in performance between using ‘l1’ penalty and ‘l2’ penalty. With ‘l2’
penalty, variance in accuracy is clearly lower than with ‘l1’ penalty. With ‘l1’, the graph of
accuracy as a function of the regularization parameter, C, has a step function like quality, where
you see consistent performance of 50% through the first half C values, and then a steep
increase in performance all the way to about 80% over a short range of C values. Interestingly,
although ‘l2’, on average, performs better with Bigram features, the highest Unigram accuracy
beats the highest Bigram accuracy, which is indicative of a higher variance in accuracy with
unigrams than with bigrams. This trend is seen across both datasets, and the reason for it exists
within the structure of the data itself, which could lead to better Unigram performance in some
cases. This is reflected in the ‘l1’ cases as well. Although the combination of ‘l1’ and bigrams
produces better average performance, we see ‘l1’ and unigrams deliver a higher single
accuracy value than ‘l1’ and bigrams. Another common trend across both datasets can be with
‘l1’ penalty and bigram features. In both cases, the graphs finish with a positive slope. Further
analysis over neighbouring values, however, shows that accuracy soon starts to dip below
optimal levels as we keep heading right and increasing C. In that light, it may be safe to
conclude that this is a result of the nature of the datasets and does not correspond to a trend of
increasing accuracy with increasing C values. C values in the set 0.2048, 0.4096, 0.8192
seem to be the best performing regularization values for ‘l2’ penalty, averaging out to
approximately 85% in accuracy for across both datasets. On the other hand, for ‘l1’penalty, the
best performing values lie in a much higher range 419.4304, 838.8608, 1677.7216, 3355.4432
averaging about 81.582%. Please keep in mind that these C values seem very far apart but
each step is actually double the previous one, which means that these are effectively 4
consecutive values in the scope of the project.
The radial kernels showed considerably better performance than the linear SVMs. The highest
accuracies achieved were 84% for the unigram case of the 2K reviews dataset, 90% for the
bigram case of the 2K reviews dataset, and 90% for both unigram as well as bigram cases of
the IMDB dataset. It is interesting to note that across the table of values for each of the 4 cases,
much of the highest accuracies are achieved at the edges of the table. Like with ‘l1’ and bigrams
in the linear SVM case, this makes one wonder as to whether or not accuracy will continue to
increase if we keep increasing our range of values along the direction of increasing accuracies.
Further experimentation, however, clarifies that this will not be the case and that we will either
see a dip in accuracy or a flatlining of accuracy at the maximum value we are currently at. An
unexpected find is that, across datasets, maximum accuracy is achieved towards the higher/top
end of the antidiagonal. This was a rather unexpected finding as it tells us that neither C nor
gamma have a stronger effect on accuracy, but it is really a combination of the 2 that matters.
Remaining values in each of the datatables corroborate this conclusion as only for high C and
low gamma values to we see high accuracy and effective classification, while remaining areas
of the table reflect relatively lower accuracy values. Specifically speaking, values that produce
highest accuracy in each case are:
gamma=0.0001, C=10000 and gamma=0.001, C=1000 and gamma=0.01, C=100
Notice that in each of these cases, the product of gamma and C is 1.
Another piece of insight that the analysis gives is about the structure of both datasets. They
seem to be somewhat similar/have uniformity in terms of structure and other broad parameters,
as classifiers perform very similar across both datasets.
VII. FUTURE WORK
Given the flexibility of Support Vector Machines and the challenges of sentiment analysis, it is
possible to have much future work based on the experiments done in this project. Apart from
using unigram and bigram frequencies, it is possible for researchers to use a variety of other
feature extraction methods to get information from punctuations, the structure of paragraphs,
capitalization, and even grammatical errors. Using such features will allow for more realistic
modelling of text, and will aid the process of sentiment analysis The idea of unigrams and
bigrams can also be extended to using ngram frequency while modeling text. It is important,
however, to realize that the feature space beyond 4gram/5gram starts becoming
phenomenally large and using mathematical techniques to reduce the feature space may be
necessary for computational reasons. Additionally, kernels play an important role in SVM
classification. Using polynomial kernels with a variety of different degree options and allowing
for the use of kernels specifically designed for sentiment analysis tasks could increase SVM
accuracy. Having a comprehensive list of stopwords that are specific to the domain of the text
may also be helpful supporting a more robust form of text modeling. Given that existing using
offtheshelf SVMs with commonly used parameters showed a best case accuracy of
approximately 90%, a larger range of gamma and C values, along with custom defined loss
functions, penalties, and kernels could certainly lead to a considerably higher accuracy. Lastly,
Natural Language Processing (or NLP) is an entire field of computer science that deals with how
computer systems understand human language. Several advanced NLP techniques can be
used to strengthen text modelling, and, therefore, increase SVM accuracy.
VIII. ADDITIONAL EXPERIMENTATION
As linear SVMs seem to perform better with bigrams than unigrams, a natural next step of the
project is to consider the trigram case with both ‘l1’ and ‘l2’ parameters. Below, graphs of linear
SVM accuracy as a function of Log(c)Base(2) are pictured:
Both graphs show tendencies that were shown by Linear SVM graphs for unigram and bigram
cases. As the value of ‘n’ in ngram increases, we see that variance within accuracy prediction
decreases. The trigram analysis was done using the 2K reviews dataset. Once again, as
expected, the highest for the trigram case came out to be below that of the bigram case for both
‘L1’ and ‘l2’ penalties.
IX. CONCLUSION
Although there is much room to improve this project’s experiments in the realm of feature
that these patterns are governed by the parameters of the SVM. We saw higher ngram features
lead to prediction with relatively higher certainty that lower ngram features, that lower ngram
based SVM depends on the combination of gamma and C, rather than on either one of them.