Vous êtes sur la page 1sur 51

Text Categorization

A re-examination of text categorization methods by Yiming Yang and Xin Liu

What is Text Categorization? "Text Categorization (TC) is a supervised learning task, defined as assigning category labels to new documents based on the likelihood suggested by a training set of labelled documents".
(Yiming, and Liu)

news stories are typically organized by subject categories or topics. They are one the most commonly investigated application domains in the TC literature.

spam filtering where email messages are classified into the two categories of spam and non-spam.

An increasing number of learning approaches have been applied including: regression models nearest neighbor classification Bayesian probabilistic approaches decision trees inductive rule learning neural networks on-line learning support vector machines

While the richness of a particular subject provides valuable information about individual methods, clear conclusions about cross-methods comparison have been difficult because often the published results are not directly comparable. For example, one cannot tell whether the performance of neural networks (NNet) by Weiner is statistically better or worse than the performance of the Support Vector Machines (SVM) by Joachims because different data collections were used in the evaluations of those methods.

Text Categorization Methods

In this paper, the authors address these issues by conducting a controlled study on five well-known text categorization methods: Neural Network (NNet) Support Vector Machines (SVM) Naive Bayesian (NB) k-Nearest Neighbor (kNN) Linear Least-squares Fit (LLSF)

Text Categorization Methods (continued)

Specifically, this paper contains the following new contributions: Provides directly comparable results of the five methods on the new benchmark corpus Reuters-21578. Proposes a variety of statistical significance tests for different standard performance measures, and suggests a way to jointly use these tests for cross-method comparison. Observes the performance of each classifier as a function of the training-set category frequency.

Task, Corpus and Performance Measures

To make their evaluation results comparable to most of the published results in TC evaluations, the authors chose newswire stories as the task and Reuters-21578 as the corpus. For this paper, they use the AptoMod version of Reuters21578, which was obtained by eliminating unlabeled documents and selecting categories in both the training and test sets. Note: Reuters-21578 is the most widely used test collection for text categorization research.

Task, Corpus and Performance Measures (continued)

This process resulted in 90 categories in both the training and test sets. After eliminating documents that do not belong in those 90 categories they obtained a training set of 7769 documents, a test set of 3019 documents, and a vocabulary of 24240 unique words. The # of categories per document is 1.3 on average. The category distribution is skewed: -The most common category has a training set of 2877, but 82% have less 100 instances and 32% have less than 10 instances

Task, Corpus and Performance Measures (continued)

For evaluating the effectiveness of category assignments by classifiers to documents, they use the standard recall, precision and F1 measure. Recall is the ratio of correct assignments by the system divided the total number of correct assignments. Precision is the ratio of correct assignments by the system divided by the total number of system assignments. F1 measure combines recall (r) and precision (p) with an equal weight in the following form:

F1 measure:

Task, Corpus and Performance Measures (continued)

These scores can be computed for the binary decisions on each individual category first and then be averaged over categories. This is called macro-averaging. Or, they can be computed globally over all n x m binary decisions where n = # of total test documents and m = # of categories in consideration. Also, they use error as an additional measure which is the ratio of wrong assignments by the system divided by the total # of systems assignments (n x m)


Introduced by Vapnik in 1995 for solving two-class pattern recognition problems. The method is defined over a vector space where the problem is to find a decision surface that "best" separates the data points into two classes. In order to define the "best" separation the authors introduce the "margin" between two classes as shown in the next figures:

SVM (continued)
The SVM problem is to find the decision surface that maximizes the margin between the date points in a training set. The decision surface by SVM for linearly separable space is a hyperplane which can be written as:

vector x = an arbitrary data point; vector w and constant b are learned from a training set of linearly separable data.

SVM (continued)
Letting D = {(yi, vector xi)} denote the training set, and yi E {+-1} be the classification for vector x The SVM problem is to find vector w and b that satisfies the following constraints

and that the vector 2-norm of vector w is minimized. It can be solved using quadratic programming techniques.

kNN stands for k-nearest neighbor classification Simple algorithm: given a test document, the system finds the k-nearest neighbor among the training documents and uses the categories of the k neighbors to weight the category candidates. The rule can be written as:

where y(vector x, cj) E {0,1} = classification for document vector di with respect to category cj; sim(vector x, vector di) = similarity of test document vector x and training document vector di; bj = the category specific threshold for binary decisions

LLSF stands for Linear Least Squares Fit A multivariate regression model is automatically learned from a training set of documents and their categories. input vector is a document in the conventional vector space model output vector consists of categories of the correspoing document.

LLSF (continued)
by solving a LLSF on the training pair of vectors one can obtain a matrix of word-category regression coefficients:

where matrices A and B present the training data and matrix FLS is the solution matrix, defining a mapping from an arbitrary document to a vector of weighted categories.

NNet techniques have been intensively studied in artificial intelligence. In their experiment, a practical consideration is the training cost. Training NNet is usually much more time consuming than the other classifiers. It would be too costly to train one NNet per category therefore, the authors trained one NNet for all the 90 categories of Reuters.

Naive Bayes (NB) probabilistic classifiers are commonly studied in machine learning. the basic idea in NB approaches is to use the joint probabilities of words and categories to estimate the probabilities of categories given a document.

Significance Tests

Micro Sign Test (S-Test)

This is a sign test design to compare two systems, A and B, based on their binary decisions on all the document/category pairs. They use the following notation:

Micro Sign Test (S-Test) (continued)

Null hypothesis: k = 0.5n or k has a binomial distribution of Bin(n,p) where p=0.5 Alternative hypothesis: k has binomial distribution of Bin(n,p) where n > 5, meaning that system A is better than B.

Micro Sign Test (S-Test) (continued)

If n <= 12 and k >= 0.5n the P-value is computed using the binomial distribution under the null hypothesis:

Micro Sign Test (S-Test) (continued)

If n <= 12 and k < 0.5n the P-value for the other extreme is computed using the the formula:

The P-value indicates the significance level of the observed evidence against the null hypothesis. e.i: system A is better or worse than B

Micro Sign Test (S-Test) (continued)

If n > 12, the P-value can be computed using the standard normal distribution:

Macro Sign Test (S-Test)

This sign test is design to compare two systems A and B using the paired F1 values for individual categories. They use the following notation:

The test hypothesis and the P-value computation are the same as those an in Micro s-test

Macro t-Test (T-Test)

Same notation as defined for S-Test, and add the following items:

Macro T-Test after rank transformation

To compare two systems, A and B, based on the F1 values after rank transformation, in which the F1 values of the two systems on individual categories are pooled together and sorted, then these values are replaced by the corresponding ranks. The following notation is used:

Comparing proportions (p-test)

For performance measures which are proportions, such as recall, precision, error, the scores of systems A and B are compared as below:

Comparing proportions (p-test)

Comparing proportions (p-test) (continued)

Comparing proportions (p-test) (continued)

Again: s-test and p-test are design to evaluate the performance of systems at a micro level, which is based on the pooled decisions on individual document/category pairs p-test and t'-test are design to evaluate at a macro level, using the performance scores on each category as the unit measure.


Experiments set up
1000 features were selected for NNet 2000 for NB 2415 for kNN and LLSF 10000 for SVM

Experiments set up
Other setting were:


Micro averaged F1 score for SVM (.85099) is slightly lower than Joachims results(.860-.864) Micro averaged F1 score (.8567) for kNN is higher than Joachims (.823) Micro averaged F1 score (.7956) for NB is higher than Joachims (.720)

Cross-classifier Comparison

Cross-classifier Comparison
The micro level analysis on pooled binary decisions suggests that SVM > kNN >> {LLSF,NNet} >> NB where the classifiers with insignificant performance are grouped into one set. Macro level analysis on the F1 scores suggests: {SVM,kNN,LLSF} >> {NB,NNet}

Cross-classifier Comparison

Performance curve on rare categories

Cross-classifier Comparison

Performance curve on all categories