Vous êtes sur la page 1sur 73

Comparison Some of Arabic Text Classification

Techniques using a Multinomial Mixture Model



Prepared by:
SihamAbdalhadyHasan
Supervised by:
Prof.Ghassan Kanaan
This Thesis was submitted in Partial Fulfilments of the
Requirements for the Masters Degree of Science in Computer
Science Faculty of Computer Sciences and Informatics

Amman Arab University
2013
ii

Authorization
I, Siham Abdalhady Hasan, authorize Amman Arab University to
provide copies of my thesis to libraries, institutions or any one
requesting a copy.

Name: Siham Abdalhady Hasan

Signature:

Date: 11/1/2013









iii

Name: Siham Abdalhady Hasan.

Degree: Master of Computer Science.

Title of thesis in English:
Comparison Some of Arabic Text Classification
Techniques using a Multinomial Mixture Model

Title of thesis in Arabic:



Examining Committee Signature
Dr.Riyad F.Al-Shalabi
Dr.Ghassan Kanaan
Dr.Omar Dabas




iv

Abbreviations
Abbreviation Description
IR Information Retrieval
TC Text Classification
ATC Arabic Text Classification
WWW World Wide Web
MMM Multinomial Mixture Model
KNN K-Nearest Neighbour
NB Nave Bayes
SVM Support Vector Machine
D Document
C Class
HTML Hyper Text Markup Language
SGML Standard Generalized Markup Language
XML Extensible Markup Language
Re Recall
Pr Precision
FFS Feature Subset Selection
VMF Von Mises Fisher
BPSO Binary Particle Swarm Optimisation
LDA Latent Dirichlet Allocation
MaP Macro Precision
MaR Macro Recall
TF Term Frequency
IDF Inverse Document Frequency
v

Table of Contents
1. Chapter One: Introduction .......................................................................................................... 1
1.1. Introduction ........................................................................................................................ 2
1.2. Arabic Language .................................................................................................................. 4
1.3. The Statement of the Problem............................................................................................ 7
1.4. Thesis Objective .................................................................................................................. 8
1.5. Summary ............................................................................................................................. 8
2. Chapter Two: Literature Review ................................................................................................. 9
2.1. Literature Review .............................................................................................................. 10
2.1.1. Text Classification ...................................................................................................... 10
2.1.2. Arabic Text Classification .......................................................................................... 17
2.2. Summary ........................................................................................................................... 25
3. Chapter Three: Methodology .................................................................................................... 26
3.1. Introduction ...................................................................................................................... 27
3.2. System Architecture .......................................................................................................... 27
1.2.3 . Arabic Corpus ............................................................................................................ 28
3.2.2. Pre-processing ........................................................................................................... 29
3.2.3. Classifiers ................................................................................................................... 33
3.2.4. Evaluation .................................................................................................................. 40
3.3. Summary ........................................................................................................................... 41
4. Chapter Four: Experiments and Evaluation .............................................................................. 42
4.1. Introduction ...................................................................................................................... 43
4.2. Data Set Preparation ......................................................................................................... 43
4.3. Performance measures: .................................................................................................... 44
4.4. Evaluation Results ............................................................................................................. 48
4.4.1. Nave Bayes Algorithm Using (MMM). ...................................................................... 48
4.4.2. Comparisons MMM With Other Techniques And Discussions 0f Results ................. 51
4.5. Results of Nave Bayes algorithm (MMM) with 5070 documents .................................... 54
4.6. Summary ........................................................................................................................... 58
5. Chapter Five: Conclusion and Future Work .............................................................................. 59
vi

.. ..................................................................................................................................................... 59
5.1. Conclusion: ........................................................................................................................ 60
5.2. Future Work: ..................................................................................................................... 60
Reference ...................................................................................................................................... 61






















vii

Table of Figures
Figure 3-1Text Classification Architecture ........................................................................................ 28
Figure 3-2Pre-processing Steps ......................................................................................................... 29
Figure 3-3 An example for KNN ......................................................................................................... 36
Figure 4-1Result Of the Naive Bayes MMM Classification Algorithm ............................................ 51
Figure 4-2Maf1, Mif1 Comparison for Classifiers ............................................................................. 52
Figure 4-3Map Comparison for Classifiers ........................................................................................ 53
Figure 4-4 MaR Comparison for Classifiers ....................................................................................... 53
Figure 4-5Precision, Recall and F-Measure for the Three Classifiers ................................................ 54
Figure 4-6Result Of the Naive Bayes Classification Algorithm .......................................................... 57










viii

List of Tables
Table 3-1Strings removed by light stemming ................................................................................... 31
Table 4-1Number of Documents for each Category ......................................................................... 44
Table 4-2 Confusion Matrix for Performance Measures .................................................................. 45
Table 4-3The Global Contingency Table ........................................................................................... 46
Table 4-4Macro-Average ................................................................................................................... 46
Table 4-5Micro-Average .................................................................................................................... 47
Table 4-6Confusion Matrix Results for NB Using MMM Algorithm .................................................. 49
Table 4-7 Confusion Matrix Results for NB Algorithm ...................................................................... 50
Table 4-8NB Using MMM Classifier Weighted Average for the Nine Categories ............................. 51
Table 4-9Classifier Comparison ......................................................................................................... 52
Table 4-10Categories and Their Distributions in the Corpus (5070 Documents) ............................. 55
Table 4-11 Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents) ............... 55
Table 4-12Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents) ................ 56
Table 4-13Nab Using Mmm Classifier Weighted Average for the Six Categories in the Arabic Corpus
(5070 Documents) ............................................................................................................................. 57











ix

Acknowledgements

I would like to express my sincerest gratitude to my supervisor, Prof. Ghassan
Kanaan, who has been exceptionally patient and understanding with me during
my studies. Without his kind words of encouragement and advice this work
would not have been possible.
I am extremely grateful to all staff that has assisted me in the Department of
Computer sciences and informatics, especially Prof.AlaaAl-Hamami.
Thanks also to all of my other colleagues in the Computer sciences and
informatics for making my time here an enjoyable experience.

I would like to thank the Libyan Embassy in Amman to take care of me to
supplement my study.

The support of my family and friends has been much appreciated, and most
importantly, I would like to thank my husband, Ali and my children, to whom I
am indebted for all of the moral and loving support they have given me during
this time.
x

Abstract
Text Classification (TC) assigns documents to one or more predefined
categories based on their contents. This project focuses on the comparison of
three automatic TC techniques: Rocchio, K-Nearest Neighbor (KNN) and
Nave Bayes (NB) classifier using a multinomial mixture model (MMM) on
Arabic language. In order to evaluation the mentioned techniques using the
MMM, an Arabic TC corpus that consists of 1445 Arabic documents that are
classified into nine categories: Computer, Economics, Education, Sport,
Politics, Engineer, Medicine, Law, and Religion. The main goal of this project
is to compare some of automatic text classification technique using a
multinomial mixture model on the Arabic language. The classification
effectiveness has been compared with the SVM model. This model was applied
in other project used the same traditional classifiers and the same collection.
Moreover; the experimental results are presented in terms of macro-averaging
precision, macro-averaging recall and macro-averagingF1 measures.
Furthermore, the results reveal that the naive Bayes using MMM work best for
Arabic TC tasks and outperformed k-NN and Rocchio classifiers.







1













1. Chapter One: Introduction












2

1.1. Introduction
The rapid development of the Internet, a larger number of Arabic information
is an available online; this motivates researchers to find some tools that may
help people to classify the huge Arabic information.
The information retrieval system is designed to analyse process, store sources
of information and retrieve those that match a particular user's requirements. In
other words in IR system, the similarity scores between a query and a set of
documents has been calculated, and the relevant documents have been ranked
based on their similarity scores. There are two main issues in the IR systems,
the first one is that characterization of the user information need is not always
clear and need to be transformed in order to be understood by the IR system
which is known as Query (short document contain few words) [1]. The second
problem is in the structure of the information where there are no standards or
rules that control this structure especially on the World Wide Web (WWW) and
each language has its own characteristics and semantics. In addition, the users
need to find excellent information which is suitable for their requirement.
Furthermore, the time has been taken in account to catch information quickly.
The issues mention to very important topic which is Text Classification (TC).
Text Classification system is classification all documents into static number of
predefined categories based on their content. Moreover, the text classification
may be either single-label where exactly one category must be assigned to each
document or multi-label where one or more categories can be assigned to each
document. Therefore, the main object of use TC is make the IR system result
better than without TC [2]. These advantages have led to the development of
automatic text and document classification systems. These are capable of
automatically organizing and classifying documents [3].
3

Classification process can be done by manual or automatic, It is interesting to
note that manual categorization consider difficult and complex task especially
with huge information, because it classifies documents one by one via human
experts. Furthermore, the time to complete this mission will be too much. on
the other hand, with the speedy growth of online text documents, automatic text
categorization (TC) becomes an essential tool to develop text documents
efficiently and effectively [4].
Text Classification is the task of classifying a document under a predefined
category. Additional officially, if d
i
is a document of the entire set of documents
D and (c
1
, c
2,,c
n
) is the set of all the categories,then text classification assigns
one category c
j
to a document d
i
According to increased number of Arabic
information on the Internet classifying documents manually is not practical.
However an automatic text classification has become an essential task to save
human effort to perform manual text classification.
Three essential stages in the Text classification system are:
Document Index: is one of the most substantial issues in TC, which includes
document representation and a term weighting scheme. The bag of word is the
most common way to represent the context of text. This approach consider
simplicity because it is recorded only the frequency of word in document.
Moreover, all the predefine categories, the synonyms and prefix words for the
category are found and it helps to assign any document to that category based
on the synonym or prefix of a term. In addition some of term weight schemes
have been indicated with details in chapter 3.
Classifier learning : There are several mechanism learning algorithms have
been applied on automatic text classification by supervised learning[5].The
supervised learning algorithm finds a representation or judgment rule from an
4

example set of labelled documents for each class[6]. This can be illustrated
briefly by Naive Bayes (NB)[7-9], Support Vector Machine (SVM)[10-12],
Nearest Neighbour (k-NN)[13, 14], Decision Trees(DT), Rocchio [5], and
Voting etc.
Classifier evaluation: to know the effective of classifier that is according the
achieved result from each one. Moreover, perfect evaluation measures such as
Recall, Precision and F1-measure have been used to evaluate the different
classifiers.

1.2. Arabic Language
Apply text classification systems for Arabic language is a challenging task due
to it has very complex morphology [15]. The Arabic alphabet consists of the 28
letters:


The characters () are called by vowels and rest letter are consonants. The
Arabic letters can be written by different forms, which depend on position it in
the word (beginning, middle, and end). For example the letter () has several
shapes. ( )if appear in the began ( which mean in English Road); ( ) If
the letter appears in the middle ( which mean in English Surface);( ) if the
letter appears in end ( which mean in English Rubber).Furthermore, the
Arabic language contain on diacritics () which putting over or below the
letters, The diacritics (fathah, kasra, dama, sukon, double fathah, double kasra,
double dama and shada) are used to clarify the mean of the words [16]. On top
of that, in Arabic words, when diacritics are not clearly mentioned, the text has
a several meanings. Therefore, ambiguous meaning which will negatively effect
5

on text classification. To avoid these problems pre-processing can be applied
on Arabic language.
The Arabic language has more complex morphology than English language.
The Arabic language is written from right to left. Arabic words have two
genders, feminine and masculine; three numbers, singular, dual, and plural; and
three grammatical cases, nominative, accusative, and genitive. Noun has the
nominative case when it is the subject; accusative when it is the object of a verb;
and the genitive when it is the object of a preposition. In addition to, the Arabic
sentence are divided into three parts: noun, verb, and character. The noun and
verb stems are derived from a few thousand root by infixing, for example,
creating words like (computer), (he calculates),and (we
calculates),from the root [17].
A noun is a name or a word that describes a person, thing, or an idea.
Arabic verbs similar to English verbs, which are classified into Perfect and
Imperfect. Perfect tense denotes actions completed, while Imperfect denotes
incomplete actions. The imperfect tense has four mood: Indicative, subjective,
jussive, and imperative [18]
Arabic particles include prepositions, adverbs, conjunctions, interrogative
particles, exceptions, and interjections.
Most of Arabic words are derived from the pattern (); all words following
the same pattern have common properties and states. For example, the pattern
() indicates the subject of the verb, the pattern () represent the object
of the verb.
An Arabic adjective can also have many variant. When an adjective modifies a
noun in a phrase, the adjective agrees with the noun in gender, number, case,
and definiteness. An adjective has a masculine singular form such as (new),
6

a feminine singular form such as (new), a masculine plural form such as
(new),and a feminine plural form such as (new) [19].
In addition to the different forms of the Arabic word that result from the
derivational process, most connectors, conjunction, prepositions, pronouns, and
possession forms are attached to the Arabic surface form as prefixes and
suffixes. For instance, the definitive nouns are formed by attaching the article
() (as the) to the immediate front of the nouns. The conjunction word () (and)
is often attached to the following word. The letters (, ,, ) can be adding to
the front to the word as prepositions. The suffix () is attached to represent the
feminine gender of the word. Also some suffixes are added to represent the
possessive pronoun (Her) , for (My) ,and, for (Their) [19, 20].
In addition, Arabic has two kinds of plurals: sound plurals and broken plurals.
The sound plurals are formed by adding plural suffixes to singular nouns. The
plural suffix is for feminine nouns in all three grammatical cases, for
masculine nouns in nominative case, and for masculine nouns in genitive and
accusative cases. Moreover, the formation of broken plurals is more complex
and often irregular, it is therefore, difficult to predict .furthermore, and broken
plurals are very common in Arabic. For example, the plural form of the noun
(child) is (children) ,which is formed by attaching the prefix and
inserting the infix .The plural form of the noun (book) is (books),which
is formed by deleting the infix .the plural form of (woman) is (women)
the plural form completely different of singular form [19].




7

1.3. The Statement of the Problem
The IR system has been widely used to assist users with the discovery of useful
information from the Internet. Furthermore, the current IR systems are based on
the similarity and frequency term between query (users requirement) and the
available information on the Internet. However, the IR ignores important
semantic relationships between them. In addition, that ignorance makes the
research operation slowly and waste a lot of time. To overcome this problem,
text classification (categorization) is a solution.
Text classification technique has been applied a little on the Arabic languages
compared to other languages. Unfortunately, there are not perfect techniques to
classify the text, thus the researchers have been encouraged to develop TC
techniques by using many different models and many methods.
In this project the Multinomial Mixture Model (MMM) has been suggested and
applied to classify the Arabic documents. In addition, this experiment will be
compared with other classifiers. In order to clarify, which model can be used
well than other.










8

1.4. Thesis Objective
Arabic text can be believed completely different of the English text and had
complex morphology. In this thesis, the Multinomial Mixture Model (MMM)
has been recommended and applied to classify the Arabic documents.
Moreover, three different techniques are examined with Arabic text such as
Rocchio algorithm, traditional k-NN and nave Bayes.
The text classification system with these techniques has been evaluated using
the standard measures: recall, precision and f-measure. Moreover, the
effectiveness of the classifier will be decided according to the results achieved.
Finally, the results of MMM has been compared with the other two algorithms
to determine the best information retrieve system to Arabic language.

1.5. Summary
This chapter gives a short introduction into Information retrieval system (IR).
It also focuses on text categorization (TC), and describes the most important
tasks of a text categorization system. After the introduction Arabic language are
has been described briefly. Moreover, the thesiss problems are presented.
Finally; the multinomial mixture model has been adopted as thesis objective.







9














2. Chapter Two: Literature Review











10

2.1. Literature Review
Text classification is defined as assigning new documents to a set of pre-defined
categories based on the classification patterns [21, 22]. In recent years, there
have been an increasing amount of literatures on TC topic. Moreover, the
researchers have shown an increased interest to continue the research and
developed it according to the previous work.
2.1.1. Text Classification
The text classification techniques have been investigated and used in many
application areas. Moreover, there are many researchers studied text
classification using different techniques.
2.1.1.1. Classification Based on Supervised Learning
The target of classification methods is assigned class labels to unlabelled text
documents from a fixed number of unknown categories. Each document can be
multiple, exactly one, or no category at all.
Supervised machine learning methods prescribe the input and output format.
The input to these methods is a set of objects (training data) , and the output is
the classes which these objects belong.
The key advantage of supervised learning methods over unsupervised methods
is that by having a clear knowledge of the classes the different objects belong
to these algorithms can perform an effective feature selection if that leads to
better prediction accuracy [23].
Classification can obviously be formulated as a supervised learning problem
with two class labels (positive and negative). Training and testing data used in
existing research are mostly product reviews, which is not surprising due to the
above assumption. Since each review at a typical review site already has a
reviewer-assigned rating (e.g., 1-5 stars) , training and testing data are readily
11

available. Typically, a review with 4-5 stars is considered a positive review
(thumbs-up), and a review with 1-2 stars is considered a negative review
(thumbs-down).
Classification is similar to but also different from classic topic-based text
classification, which classifies documents into predefined topic classes
(politics, sciences, sports, etc.). In topic-based classification, topic related
words are important. In sentiment classification, topic-related words are
unimportant. Instead, sentiment or opinion words that indicate positive or
negative opinions are important (e.g., great, excellent, amazing, horrible, bad,
worst, etc.).
Existing supervised learning methods can be readily applied to sentiment
classification, such as, Nave Bayesian, Support Vector Machines (SVM) [24].
This approach can be used to classify movie reviews into two classes (positive
and negative). It was shown that using unigrams (a bag of individual words) as
features in classification performed well with both Nave Bayesian and SVM.
Neutral reviews were not used in this work, making the problem easier. The
used features are data attributes used in machine learning, not object features
referred to in the previous section.
Subsequent research used many more kinds of features and techniques in
learning. As most machine learning applications, the main task of sentiment
classification is to find a suitable set of features. Some of the example features
used in research and possibly in practice such as mentioned [25].
Terms and their frequency: These features are individual words or word n-
grams and their frequency counts. Sometimes, word positions may also be
considered. The TF-IDF weighting scheme from information retrieval may be
applied too. These features are also commonly used in traditional topic-based
12

text classification. They have been shown quite effective in sentiment
classification as well.
Part Of Speech Tags: In many early researches, it was found that adjectives
are important indicators of subjectivities and opinions. Therefore, adjectives
have been treated as special features.
Opinion Words and Phrases: Opinion words are words that are commonly
used to express positive or negative sentiments. For example, beautiful,
wonderful, good, and amazing are positive opinion words, and bad, poor, and
terrible are negative opinion words. Although many opinion words are
adjectives and adverbs, nouns (rubbish, junk, crap, etc.) and verbs (hate, and
like) can also indicate opinions. In addition to opinion words, there are also
opinion phrases and idioms (cost someone an arm and a leg). Opinion words
and phrases are helpful to sentiment analysis.
Syntactic Dependency: Words dependency based features generated from
parsing or dependency trees are also tried by several researchers.
Negation: Clearly negation words are important since their appearances often
change the opinion orientation. For example, the sentence I dont like this
camera is negative. Negation words must be handled with care because not all
occurrences of such words mean negation. For example, not in not only but
also does not change the orientation direction.
Research also predicts the rating scores [24]. In this case, the problem is
formulated as a regression problem since the rating scores are ordinal. Another
investigated research direction is the transfer learning or domain adaptation. As
it had been shown, sentiment classification is highly sensitive to the domain
from which the training data are extracted. A classifier trained using
opinionated texts from one domain often performs poorly when it is applied or
13

tested on opinionated texts from another domain. The reason is that words and
even language constructs used in different domains for expressing opinions can
be substantially different. Sometimes, the same word in one domain means
positive, but in another domain means negative [26]. For example, the adjective
unpredictable may have a negative orientation in a car review (unpredictable
steering), but it could have a positive orientation in a movie review
(unpredictable plot). Therefore, domain adaptation is needed. Existing
research has used labelled data from one domain and unlabelled data from the
target domain, and general opinion words as features for adaptation [27].

2.1.1.2. Classification Based on Unsupervised Learning
Opinion words and phrases are the dominating indicators for sentiment
classification. Therefore, using unsupervised learning based on such words and
phrases would be quite natural. The methods used in [26].performs
classification based on some fixed syntactic phrases that are likely to be used to
express opinions. The algorithm consists of three steps:
Step 1: It extracts phrases containing adjectives or adverbs. The reason for
doing this is that research has shown that adjectives and adverbs are good
indicators of subjectivity and opinions. Although an isolated adjective may
indicate subjectivity, there may be an insufficient context to determine its
Opinion Orientation. (OO) Thus, the algorithm extracts two consecutive words,
where one member of the pair is an adjective/adverb, and the other is a context
word.
For example: In the sentence, This camera produces beautiful pictures,
beautiful pictures will be extracted as it satisfies the first pattern.
14

Step 2: It estimates the orientation of the extracted phrases using the Point wise
Mutual Information (PMI) measure given in equation (1.1), (1.2) and equation
(1.3) [26].


(

=
) 2 ( ) 1 (
) 2 & 1 (
2 log ) 2 , 1 (
term p term p
term term p
term term PMI

1.1

P (term1& term2) is the co-occurrence probability of term1 and term2, and P
(term1) P (term2) is the probability that the two terms co-occur if they are
statistically independent. The ratio between P (term1&term2) and P (term1) P
(term2) is a measure of the degree of statistical dependence between them. The
log of this ratio is the amount of information that we acquire about the presence
of one of the words when the other is observed.
The Opinion Orientation (OO) of a phrase is computed based on its association
with the positive reference word excellent, and its association with the
negative reference word poor:

). ' ' , ( ) ' ' , ( ) ( poor phrase PMI excellent phrase PMI phrase SO =

1.2

The probabilities are calculated by issuing queries to a search engine and
collecting the number of hits.
For each search query, a search engine usually gives the number of relevant
documents to the query, which is the number of hits. Thus, by searching the two
terms together and separately, we can estimate the probabilities in equation(1.1)
[26] used the AltaVista search engine because it has a NEAR operator, which
constrains the search to documents that contain the words within ten words of
15

one another, in either order. Let hits (query) are the number of hits returned.
Equation (1.2) can be rewritten as:

(

=
) ' (' ) ' ' (
' (' ) ' ' (
2 log ) (
excellent hits poor phraseNEAR hits
poor hits excellent phraseNEAR hits
phrase SO

1.3

Step 3: Given a review, the algorithm computes the average OO of all phrases
in the review, and classifies the review as recommended if the average OO is
positive, not recommended otherwise.

TC system for Arabic language may not consider easy task compare with
English language. Because the Arabic language has very complex morphology
[15]. Moreover, the test system considers the most important stage in any IR
system. It is used to determine the efficiency of the system and helped to know
which system is better than other.

The study was presented review the key text classification techniques including
text model, feature selection methods and text classification algorithms in
building a text classification system. In addition, the text classification system
based on Mutual Information, K-Nearest Neighbour algorithm and Support
Vector Machine had been implemented. The data set was created from the
famous Reuters-21578 text classification collection. Furthermore, the
experiment result was shown that the Classification accuracy rates were 91.1
%. This was mentioned to obtain better performance than no feature selection
and improved the classification rate. Moreover, the SVM classifier gains the
higher performance compared to KNN classifier [28].
16

In 2011, the practical a new feature selection method (Auxiliary Feature) [9]
had been used. In addition the enhancement performance of Naive Bayes for
Text Classification was proved and an auxiliary feature method was proposed
to determine features by an existing feature selection method, and selected an
auxiliary feature which can reclassify the text space aimed at the chosen
features. In order to evaluate this experiment; the date set was choose 30000
junk mails and 10000 normal mails from CCERT. The result from this study
shows that the proposed method indeed improves the performance of naive
Bayes classifier.

Feature sub-set selection (FSS) is an important step for effective text
classification (TC) systems. Due to it may have a great effect on accuracy of the
classifier [29, 30]. However, there are many valuable studies that investigated
FSS metrics for English text classification tasks and, there are some works that
handles the FSS problem with The Arabic text classification tasks.

In recent years, there has been an increasing amount of literature on an empirical
comparison of seventeen FSS metrics for Arabic TC task using SVM5
classifier, the evaluation used an Arabic corpus that consists of 7842 documents
which are independently classified into ten categories. However, the result of
experiment was proven that Chi-square and Fallout FSS metrics work best for
Arabic TC tasks [30].

An improved KNN algorithm for text classification was proposed, which builds
the classication model by combining constrained one pass clustering algorithm
and KNN text categorization. Despite the KNN is a simple and effective method
17

for text classification, it has three drawback points: firstly, the complexity of its
sample similarity computing is huge; secondly, its performance is easily
affected by single training sample, thirdly, KNN consider a lazy learning cause
do not build the classication model. Moreover, to overcome these drawbacks,
the improved KNN algorithm was executed. In addition, this algorithm was
used Vector Space Model (VSM) to represent the documents. The result show
that the INNTC classier is much more effective and efficient than KNN [14].

GAMON, AUE [31] a novel project prototype based classifier for text
classification had been implemented. The basic idea behind the indicated
algorithm is based on which document categories in modelled by as set of
prototypes and their individual term subspace of the document category. The
classifier was tested using two English data sets then compared its perform
with other five classifiers: SVM, three prototype, KNN, KNN-model and
centroid classifier. The experiment result of the suggested classifier show that
the project prototype based classifier was achieved higher classifier accuracy at
a lower computation cost than the traditional prototype based classifier
especially for date includes interfering document classification.

2.1.2. Arabic Text Classification
The studies carried out for Arabic text classification consider very few
compared to other languages (like English). Due to the Arabic language has an
extremely rich morphology and complex orthography. However, there are some
related work had been proposed to classify Arabic documents:

18

It had been claimed that three classifiers KNN, NB and distance-based classifier
had been implemented for Arabic text classification. Every category was
represented as a vector of keywords in the distance-based and KNN. On the
other hand, the vectors with the NB were bag of word. The Dice measure was
used to calculate the similarity between them. In addition, the accuracy of the
classifier was tested using Arabic text corpus, which collected from online
magazines and newspapers. According to the result, the NB classifiers do better
than other two classifiers [32].

The researcher indicated to SVM algorithm. It had been implemented on Arabic
text classification. The paper pointed out to the SVM classifier achieved the
result better than other classifiers such as (Nave Bayes & KNN). In addition,
the light stemming on Arabic TC tasks was evaluated with SVM classifier. As
a result, the light stemming did not enhance the performance of Arabic SVM
text classifier. On the other hand, Feature Subset Section (FSS) had been
implemented and improved the performance of Arabic SVM text classifier.
Furthermore, the best result had been achieved with two methods of feature
subset section included (Chi-square, NGL).Finally, a new Ant Colony Based
FSS algorithm (ACO) had been applied to achieve the greatest TC effectiveness
of the six methods of FSS [12].
The main object was compared automatic text classification using kNN,
Rocchio and NB classifier on the Arabic language [33]. Moreover the system
had been tested by using a corpus of 1445 Arabic text document. Additionally
two models were used; the first model was the Support Vector Machine (SVM).
It used to implement KNN and Rocchio classifiers. Each document was
represented as a vector of terms. The second model was probabilistic, which
19

used to execute NB classifier. In probabilistic model the probability of
document belonging to any class had been calculated. The document assigned
to class has maximum probabilistic. However, the experiments shown the Nave
Bayes is the best performer followed by kNN and Rocchio.

The paper had been reported that comparison between two probabilistic
classifiers. In addition, the researchers mentioned to the multinomial model
were given the result better than the multivariate Bernoulli model at large
vocabulary sizes. In contrast when the vocabulary size is smaller the
multivariate Bernoulli model outperforms the multinomial model. Furthermore,
the results were tested on five real world corpora. As the evaluation of their
experimental proved that the multinomial model was reduced error by an
average of 27%, and sometimes by more than 50% [34].

It was implemented the probabilistic generative models called parametric
mixture models (PMMs). The main goal of PMMs was to avoid multiclass and
multi-labelled text categorization problems. In addition, the PMMs was
achieved good results compared on the binary classification, due to PMMs can
simultaneously detect multiple categories of text instead depend on binary
judgment. Furthermore, the PMMs approaches was applied by using World
Wide Web pages, and showed on its efficiency [27].

It had been demonstrated comparative study of generative models for document
clustering was used multinomial model. In addition, the comparative this model
with other two probabilistic models such as multivariate Bernoulli, and Von
Mises-Fisher [2003] (VMF) was performed by applying the cluster.
20

Unfortunately the Bernoulli model was the worst for text clustering. On the
other hand, the VMF model produced clustering results better than both
Bernoulli and multinomial models.

The literature, a novel mixture model method for text clustering was named
multinomial mixture model with feature selection (M3FS). The M3FS method
was used MMM instead using the Gaussian mixtures to improve text clustering
tasks. Prior studies that have noted no label in unsupervised text clustering was
the hard problem of feature selection. In order to overcome this problem the
M3FS was proposed to text cluster. Furthermore the results demonstrate that
M3FS method has good clustering performance and feature selection capability
[34].

The main idea was discussed by two problems, one is many irrelevant features
which may affect the speed and also compromise the accuracy of the used
learning algorithm. The second challenge is the presence of outliers, which
affects the resulting models parameters. For this reason, the researchers were
suggested apply an algorithm that partitions a given data set without a priori
information about the number of clusters. Furthermore novel statistical mixture
model, based on the Gamma distribution, which makes explicit what data or
features have to be ignored and what information has to be retained. The
performance of method of a nite mixture model by using different applications
with analysis of data, real data and objects shape clustering have been proposed.
Moreover the experiment was prove this approach has excellent modelling
capabilities and that feature selection mixed with outliers detection inuences
signicantly the clustering performance [24].
21

It had been discussed the history of naive Bayes in information retrieval, and
presents a theoretical comparison of the multinomial and the multi-variate
Bernoulli (again called the binary independence model) [31].

Compared to Indo-European languages (like English), the Arabic language has
an extremely rich morphology and a complex orthography. This is one of the
main reasons [17, 34, 35] behind the lack of research in the field of Arabic text
classification. However, many machine learning approaches have been
proposed to classify Arabic documents: Support Vector Machine (SVM)
classifier with the Chi-square feature extraction method [35]the Nave Bayesian
method, k-Nearest Neighbours distance based classifiers, the Rocchio
Algorithm [31].

Sawaf, Zaplo and Ney [15] had used the maximum entropy method for Arabic
document clustering. Initially, documents were randomly assigned to clusters.
In subsequent iterations, documents were shifted from one cluster to another if
an improvement was gained. The algorithm terminated when no further
improvement could be achieved. Their text classification method is based on
unsupervised learning.

Duwairi [17] had proposed a distance-based classifier for Arabic text
classification tasks, where the Dice measure was used as a similarity measure.
According to work had been done by Duwairi, each category was represented
as a vector of words. In the training phase, the text classifier scanned training
documents to extract features that best capture inherent category specific
22

properties. Documents were classified on the basis of their closeness to the
feature vectors of the text.

El-Halees [34] was implemented a maximum entropy based classifier to classify
Arabic documents. Compared with other text classification systems (such as El-
Kourdi et al. and Sawaf et al.), the overall performance of the system was good
(in comparisons, the results were used as recorded in the published papers
mentioned above by El-Halees).

Hmeidi, Hawashin and El-Qawasmeh [13] reported a comparative study of
SVM and K-Nearest Neighbours KNN classifiers on Arabic text classification
tasks. The concluded proven that SVM classifier shows a better micro-
averaging F1-measure.

Al-Saleem [28] proposed an automated Arabic text classification using SVM
and NB classification methods. These methods were investigated on different
Arabic datasets. Several text evaluation measures had been used. The
experimental results against different Arabic text classification datasets showed
that SVM algorithm outperforms the NB with regards to all measures (recall,
precision and F-measure). The F-measure of SVM was 77.8% while 74% for
NB.

Al-Diabat et al, [12] had investigated the problem of Arabic Text Classification
(ATC) by using rule-based classification approaches. The performance of
different classification approaches that produce simple "IF-Then" knowledge to
find the most appropriate one to handle ATC problem were evaluated. Four
23

rule-based classification algorithms were investigated: One Rule, rule induction
(RIPPER), decision tree (C4.5), and hybrid (PART). Arabic data collection with
1526 text documents that belong to 6 categories was used. The results showed
that the hybrid approach of PART outperforms the rest of algorithms. The
average for precision was 61.9% and for recall 62.3%.

Wahbeh et al [19] had compared three text classification techniques, SVM, NB,
and C4.5. A set of Arabic documents collected from different websites was
used. Moreover, four categories, and used WEKA toolkit for running the
previous classifiers, were used. The word representation has been implemented
to represent the documents. The project have proposed an approach for ATC
using Association Rule Mining; their approach facilitated the discovery of
association rules for building a classification model for ATC. Three
classification methods had used that used association rules: ordered decision
list, weighted rules, and majority voting. The experimental results showed that
the majority voting method gave better results than other methods.
A novel batch mode active learning using SVM for Arabic text classification
has been presented, while there are no a lot of studies done in this area for the
Arabic language. The purpose of apply active learning is reduced the amount of
the data needed for the training phase. Thus the cost of manual annotating the
data will be less; also the process of learning can be hurried while the active
method is allowed to choose a data from which it learns [17].

As long as the feature selection is a key factor in the accuracy and effectives of
resulting classification, the author mentioned to Binary Particle Swarm
Optimisation (BPSO) as the feature selection for Arabic text classification. The
24

aim of apply Bpso/Knn is to find a good subset of feature to facilitate the task
of Arabic text classification. SVM, Nave Bayes and C4.5 decision tree have
been applied as classification algorithms. However the suggest method was
effective and achieved satisfy outcome on classification accuracy [25].

It was reported that multiword features has implemented to improve Arabic text
classification. Multiword features are displayed as a mixture of word appearing
within windows of varying size. However, multiword features were applied
with two similarity functions: the dice similarity function and the cosine
similarity function to improve the outcome of Arabic text classification.
According to the results had achieved the dice function perform better than the
cosine function. With the dice similarity function, the frequencies of the features
in the document are ignored and only their existence is recognized [11].
The investigator concentrated on just a single label assignment. The goal of this
paper is to present and compare result obtained against Saudi Newspaper Arabic
text collection using SVM algorithm and NB algorithm. However, the
experiment shows that the SVM classifier achieved better result than NB
classifier [28].

Latent Dirichlet Allocation (LDA) had been proposed as text feature. LDA was
used to index and represent Arabic texts. However, the mine idea behind LDA
is that documents are represented as random mixtures over latent topics where
each topic is described by a distribution over words. SVM was used to apply
classification task. Moreover, LDA-SVM algorithm was achieved high
effectiveness for Arabic text classification, which exceeds without LDA, Nave
Bayes and KNN classifiers [20].
25


2.2. Summary
In chapter 2, there are different text classification algorithms were described. A
little was explained general in text classification and other explicated specially
the Arabic languages. In addition, some papers with the Multinomial Mixture
model had been shown.




















26













3. Chapter Three: Methodology













27

3.1. Introduction
There are many approaches can be used in text classification. KNN, Rocchio
and Nave Bayes by using MMM model have been implemented. Moreover,
these algorithms have been applied to the same datasets.
The main aim of applying TC on the Arabic languages is to improve the
performance of information retrieved without TC. Many steps can be done to
implement the TC task. However, all phases have been explained in section 3.2.
An IR process starts with the submission of a query, which describes a users
topic and finishes with a set of ranked results estimated by the IRs ranking
scheme to be the most relevant to the query [33].
Recall and Precision consider famous measures; these can be used to evaluate
any IR system. Furthermore, the efficiency of the system can be determined by
using those measures.
This chapter is divided into three main sections. The section 3.1 shows overview
about the project. Section 3.2 has been presented the main text classification
system architecture.Section3.3mentioned to the short summary of chapter three.
3.2. System Architecture
The text classification technique has been implemented by passing through
several phases. Moreover, these phases execute sequentially to facility the TC
task. Uncategorized documents were pre-processed by removing punctuation
marks and stopwords. Every document is then represented either as a vector of
words only or as a vector of word, their frequencies and number of documents
in which these words appeared (inverse document frequency). Stemming was
used to decrease the dimensionality of feature vectors of document. The
accuracy of the classifier is computed using recall, precision and F-measure.
28

However, the figure 3-1 below indicated to the main steps in classification
system.

Figure 3-1Text Classification Architecture

3.2.1. Arabic Corpus
The accuracy of the classifiers was tested using corpus contains of 1445
documents. The documents can be divided into nine categories: Computer,
Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.
Moreover, some of documents will be used for training classifier and sets for
testing classifier. Testing sets are the input documents need to be classified.
Training is set of documents or topics tagged with the correct classes. However,
the corpus and categories have been shown and explained with more details in
chapter 4.In addition the figure 3-2 show pre-processing steps:

29


Figure 3-2Pre-processing Steps

3.2.2. Pre-processing
The pre-processing can be defined as the process of filter out words, which may
not give any meaning to a text, also might not be useful in information retrieval
systems. These words are called stop word [30]. The purpose of applying pre-
processing is to transform documents into a suitable representation for
classification task. In addition, it is reduced size of information which may
make search operation faster. The pre-processing can be done as follows:
-The different documents which have format such as HTML, SGML and XML
are converted to plain text format.
-Tokenization: Tokenization divides the document to set of tokens (words).
-Digital and Punctuation mark have been removed from each document.
Normalization: It is a very essential phase that will reduce many words that
have the same meaning, but it was written in different forms. Arabic language
refers to a very common problem when a single word has been written in many
forms like - - (which mean in English start).

30

-Remove stopword: There are two kinds of the terms in any document; the
firstly, it has been called stopword which occur commonly in all documents and
may not give any mean to the document. The secondly, it can be described as
Keywords or features. However, stopwords such as (punctuation mark,
formatting tags, prepositions, pronouns, conjunction and auxiliary verb) have
been removed to reduce the text size and save the process time. Moreover, this
process is essential to removing these high-frequency words because they may
misclassify the documents.

Stemming: Stemming is another common pre-processing step. The steaming
phase in Arabic is complex than in English language, while (Arabic gender
word has forms for singular, dual and plural). The purpose of decrease the size
of the initial feature set is removed misspelling or word with the same stem. It
is necessary to enhanced performance of information retrieval system.
Moreover, there are some different approaches to perform stemming: the root-
based stemmer; the light stemmer; and the statistical stemmer. However, the
light stemmer has been used in this thesis as stemmer:

Step 1: Normalization
- Remove diacritics (primarily weak vowels) except shada.
- Replace ,, and with .
- Replace final with .
- Replace final with .
Step 2: Waw Removal
Remove initial "and" if the remainder of the word is 3 or more characters long.
Step 3: Prefix Removal
31

Remove any of the definite articles (prefixes) if this leaves 3 or more characters.
Step 4: Suffix Removal
List of suffixes are indicated in Table 3-1, removing any suffix that are found
at the end of the word, if this leaves 3 or more characters.

Table 3-1Strings removed by light stemming
Prefixes , , , ,
Suffixes , , , , , , ,

Indexing:
Document index is one of the most substantial issues in TC, which includes
document representation and a term weighting scheme. The bag of word is the
most common way to represent the context of text. This approach consider
simplicity because it is recorded only the frequency of word in document.
Moreover, all the predefine categories, the synonyms and prefix words for the
category are found and it helps to assign any document to that category based
on the synonym or prefix of a term.
Several measures had been applied to calculate the weighted terms:
Term Frequency (TF): is the simplest measure to weight each term in a text.
The drawback with TF is concerns on term occurrence with in a text, this causes
in improve recall but does not improve precision according to the result was
achieved.
Inverse Document Frequency (IDF): the other popular weight measure is
IDF. The mine idea with IDF is concerns on the terms which rarely occur in a
collection of text. This perform in improve precision without enhance recall.
TF.IDF: as long as terms weight is effect on evaluated the text classification.
TF.IDF is combined two weight measures TF and IDF to enhance both recall
32

and precision then to enhance the text classification result. On the other hand in
TF.IDF when a new document occurs, recalculation of weighting factor to all
documents is needed since it depends on number of documents.
TF-IDF has been used in this work as one of the most popular weight schemes.
It considers not only term frequencies in a document, but also the frequencies
of a term in the entire collection of documents classic TF IDF
t,d
assigns to
term t a weight in document d as showing in equations 3.1 and 3.2 [33]:

TFIDF(i, j) = TF(i, j). IDF(i) 3.1

Thus, TF*IDF weighting assigns a high degree of importance to terms occurring
frequently only in few documents of a collection. Inverse Document Frequency
(IDF) for term Ti calculated as fallowing:


IDF(i) = log
N
DF(i)
n
3.2
Where, DF(i)(document frequency of term Ti) is
number of documents in which Ti occurs.

Feature Selection:
Feature sub-set selection (FSS) is one of important pre-processing steps of
machine learning and essentially a task for text classification. Feature selection
methods study how to choose a subset of attributes that are used to construct
33

models describing data [15].There are many methods of FSS have been applied
on Arabic text [12,26].
According to the previous relation works the FSS approach was proven to
provides several advantages for text classification system because it has very
effective in reducing dimensionality, removing irrelevant and redundant terms
from documents and decrease computational complexity. In addition the FSS
increasing learning accuracy, improving classification efficiency and scalability
by make building the classifier is usually simpler and faster).
Many FSS algorithms have been tested and comparison in text classification
system for example: Chi-square and fallout were achieved the satisfy result in
Arabic TC tasks and Ant colony (ACO) is an optimization algorithm ,which is
derived from the study of real ant colonies ant it is one of the hopeful approaches
to better feature selection.
To classify a new document was pre-processed by removing punctuation marks
and stopwords, followed by extracting the roots of the remaining keywords. The
feature vector of a new document and the feature vector of all categories should
be compared. Ultimately, the document was assigned to the category with has
maximum similarity.

3.2.3. Classifiers
There are a lot of classifiers type have been applied and excused in text
classification area. Moreover, the result was completely different from one to
another, while every classifier has specific algorithm. However, many sort of
classifier has been explained and shown the advantages and drawback
according to the achieved result.
34

3.2.3.1. Support Vector Machine (SVM)
Support vector machine has been widely applied in text classification area [12,
20, 24]. SVM classifier is one of supervised machine learning techniques. .The
document is represented as a vector of terms (words).each dimension
corresponds to a separate term. When the term occurs in the document the value
of the vector is not null but this value can calculated using the best weigh
methods such as tf*idf. In linear classification, SVM creates a hyper plane that
divide the data into two sets with the maximum-margin. Hyper plane with the
maximum margin has the distance from the hyper plane to points when the two
sides are equal. To apply the SVM learn the function bellow can be used as
showing in equation 3.3 [24]:

() = ( +) 3.3

Where W is a weighted vector in
n
R
.SVM find the hyper plane
b wx y + =
by
separating the space
n
R
into two half spaces with the maximum-margin [24].
The SVM classifier is one of the simple and effective algorithms that execute
classification task. In addition SVM classifier has the potential to handle a huge
number of features. On the other hand with the svm classifier the document
containing similar contexts but different terms vocabulary are not classified as
the same category. Also in the vector representation the order of the terms which
appear in the document was lost.
The goal of this project is to compare three different classification techniques
on the Arabic language, namely kNN, Rocchio and Nave Bayes by using
multinomial model.
35

3.2.3.2. K-Nearest Neighbour (KNN)
The K-nearest neighbour (KNN) is one of the famous text classification
techniques. The principle KNN technique: all the documents that are close in
the space belong to the same class. However, the essential idea with KNN is to
identify the class of document based on the similarity measure.
The KNN has advantages such as simple, non-parameter and shows a very good
performance on text categorization tasks for Arabic text Language. On the other
hand, the KNN has drawbacks such as difficult to find optimal value of k;
classification time is long due to the distance of each query instance to all
training samples has been computed. In addition, this classifier has been called
lazy learning system, because it does not involve a true training phrase [13].
The Major Steps to Apply K-Nearest Neighbour Classifier:
Pre-process documents in training set.
Choose the K parameter value, K value means the number of nearest
neighbours of d in the training data.
Determine the distance between the testing document (d) and the training
documents (previous classes).
Class the distance and determine neighbours based on the minimum distance of
k-distances.
To classify an unknown document, the KNN classifier ranks the documents
neighbours among the training documents and uses the class labels of the k most
similar neighbours. The similarity score of each nearest neighbour document to
the test document is used as the weight classes for the neighbour document. If
a specific category is shared by more than one of the k-nearest neighbours, then
the sum of the similarity scores of those neighbours is obtained from the weight
of that particular shared category [17].
36

An example of KNN classification has been showed in figure3-2.a. Moreover,
the document X has been assumed as test sample, which should be classified
either to the first category of white circle or to the second category of black
circle. If k = 1 the document X will be classified to the white category, because
there is one white circle and zero black circle inside the inner circle. If k = 5 it
is classified to black category, because the number of black circle more than of
white circle. Majority voting was used to determine the category of an
unclassified document. On the other hand, if K=10 the document will be
classified to both (black, white).To avoid this problem the similarity has been
determine according to total weight of two categories as show in figure3-2.b.

Figure 3-3 An example for KNN

If K=5 the document has been classified to white category, because the sum
weight of white category (9) more than blacks weight (8).
3.2.3.3. Rocchio:
Rocchio relevance feedback algorithm is one of the most popular and widely
applied learning methods from information retrieval. In addition, the Rocchio
37

is considered easy to implement and very fast compared to KNN [33]. The basic
idea behind applying the Rocchio approach is uses a vector to represent each
document and class. The vector to represented the classes (
c

) has been called


prototype or centroid [5, 33].
Prototype for each class calculated by subtract the average all document
appeared in class C
j
of the average all document do not appears in the classC
j
.



e e

=
j j
C D d j C d j
j
d
C D
d
C
c

| |
1
| |
1
| o

3.4

Where, & are parameters that adjust the relative impact of positive and
negative training examples.
Practically in text classification, Rocchio calculates similarity between test
document and each of prototype vectors. Then, the test document assigns to the
category which has the maximum similarity score.
3.2.3.4. Naive Bayes:
Naive Bayes classifier uses a probabilistic model of text. It achieves good
performance results on TC task for Arabic text [18].
The NB mentioned that it is a simple probabilistic classifier based on applying
Bayes theorem the condition probability P (C
j
|d
i
) for each class can be
computed from equations 3.5 and 3.6 [8, 9]:


p(c
j
|d
i
) =
P(C
j
)P(d
i
|c
j
)
P(d
i
)

3.5

38

Where, P (c
j
) is the prior probability of a document occurring in class
c
j
.Frequently, each document (d
i
) in text classification represented as a vector
of words (v
1
, v
2
, . v
t
) then the above equation become as:


p(c
j
|d
i
) = P(C
j
)

k=1
t
P(v
k
|c
j
)
P(d
i
)

3.6

The Nave Bayes is frequently used in text classification; due to it has speed
and simplicity. Moreover, there are two event models of Nave Bayes:
multinomial model and Bernoulli model [34].
In Bernoulli model, a test document is classified as binary occurrence
information, the number of occurrences is ignored. Although multinomial
model is kept tracking of multiple occurrences.[35].

3.2.3.5. Multinomial Mixture Model
It is necessary to clarify exactly what is meant by MMM. It can be defined as
the distribution of words in a document as a multinomial. Furthermore, a
document is treated as a sequence of words and it is assumed that each word
position generated independently of every other [30]. In text classication, the
use of class-conditional multinomial mixtures can be seen as a generalization
of the Naive Bayes text classier relaxing its (class-conditional feature)
independence assumption [29].When a test document is classified, an MMM
keeps track of multiple occurrences compared with another model such as
Bernoulli model [31]. The Bernoulli model uses binary occurrence information
and ignores the number of occurrences. As long as an MMM keeps the
39

occurrence from all words (frequency, position), thus, this makes the
classification task easier according to equations 3.7, 3.8, 3.9, 3.10 and 3.11 [35].


p(c
j
|d
i
) =
P(C
j
)P(d
i
|c
j
)
P(d
i
)

3.7


(

)=

N

3.8

Where,

is number of the document in class.C

.N is number of document in
collections.

p(c
j
|d
i
) = P(C
j
)

k=1
t
P(w
c
|c
j
)
P(d
i
)

3.9


P(w

|c
j
) =
count (w, c
j
) +1
count () +

3.10

Where,count (w, c
j
) is frequency word W in class cj . count () is total number
of the words in classcj. is Total number of the words in collection.
The Bayes classifier compute separately the posterior of document D falling
into each class, and assign the document to the class with the highest
probability, that is


c
optimal|
= arg max (p(c
j
|d
i
); 1 i |C
3.11

Where, |C| is the total number of classes
40

3.2.4. Evaluation
As long as there are many retrieval systems on the market, but which one is the
best. It depends on the result which proposed from every one. An important
issue for information retrieval systems is the notion of relevance. The purpose
of an information retrieval system is to retrieve all the relevant documents
(recall) and no non-relevant documents (precision). Recall and precision are
defined as:
Precision: The ability to retrieve top-ranked documents that are mostly
relevant.
Precision =
Numberofrelevantdocumentsretrieved
Totalnumberofdocumentsretrieved

3.12

The maximum (and optimal) precision value would be 100% and the worst
possible precision of 0% is achieved when not a single document was retrieved.

Recall: The ability of the search to find all of the relevant items in the corpus.

Recall =
Numberofrelevantdocumentsretrieved
Totalnumberofrelevantdocuments

3.13

The perfect information retrieval system can be achieved when the result of both
recall and precision equal one.
F1-measure: As a measure of effectiveness that combines the contributions of
precision and recall. The well-known F1 measure function is used to test
perform of the Information retrieval systems [33], which defined as:
Re Pr
Re . Pr 2
1
+
=
= |
F

3.14
41

Fallout: It is another evaluated measure can be used to evaluate the Information
Retrieval systems. Although, Recall and Precision consider the good evaluation
measure but they do not care on number of irrelevant documents in the
collection, that caused to undefined recall when there is no relevant document
in the collection, also to undefined precision when no document is retrieved.
However, Fallout number of irrelevant documents in the collection had been
taken in account. In another word the Fallout is inverse of Recall, that is indicate
to a good system should have high recall and low fallout.

3.3. Summary
This chapter gives some introduction to information retrieval, and describes the
common tasks of a TC system. Using multinomial mixture model as a machine
learning algorithm is nowadays the most popular approach. In the rest of chapter
three interesting kinds of TC algorithms have been described briefly.












42













4. Chapter Four: Experiments and Evaluation












43

4.1. Introduction
Automatic Text Classification is defined as classifying unlabelled documents
into predefined categories based on its contents. It has become an important
topic due to the increased number of documents on the internet that people have
to deal with daily; this in itself has led to the urgent need of organizing them. In
this chapter, experiments will be achieved then the performance of the Rocchio
algorithm with traditional k-NN and Nave Bayes using MMM classifiers will
be documented.
These classifiers will be evaluated by some measures in order to know whether
Nave Bayes using MMM outperforms the other classifiers. The rest of this
chapter will be organized as following: section 4.2 will discuss the preparing
process for data set evaluation. Section 4.3 will list the performance measures.
Section 4.4 will discuss the evaluation results. Section 4.5 will discuss the
results of MMM with 5070 documents. In section 4.6 will show the summary.
4.2. Data Set Preparation
The corpus has been downloaded from [34]. The documents classified into nine
categories. The categories and number of documents of each one of them
appears in table 4-1. The total number of documents is 1445. The length of
documents is varying from each other. The nine categories are: Computer,
Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.
After the pre-processing achieved on all the documents, a copy of these pre-
processed documents have been converted into Attribute-Relation File Format
(ARFF) in order to be suitable for Weka tool.

44

Table 4-1Number of Documents for each Category
NO Category Number
1 Medicine 232
2 Economics 222
3 Religion 222
4 Sport 232
5 Politics 481
6 Engineer 441
2 Law 72
8 Computer 22
7 Education 88


4.3. Performance measures:
Computational efficiency and classification effectiveness is what it meant of
performance of text classification algorithm. So, when a large number of
documents categorized into many categories, the efficiency of text classification
will be take into account. The effectiveness of text classification will be
measures by precision and recall.
Precision and Recall is defined as follows:

Recall =
tp
tp+fp
tp +fp>0, 4.1
Precision =
tp
tp +fn
tp +fn > 0 , 4. 2
45

Where, counts the number of documents that classified by classifier correctly,
while counts the number of documents that classified by classifier
incorrectly, fp counts the number of documents that not classified by classifier
correctly counts the not assigned but incorrect cases and tn counts the not
assigned and correct cases. As showed in table 4-2.

Table 4-2 Confusion Matrix for Performance Measures
Classifier
Decision
Correct Decision By Expert
YES is correct NO is incorrect
Assigned YES Tp Fn
Not Assigned
NO
Fp tn

Recall is the fractions of relevant instances that are retrieved as it appear in
equation4.1, while Precision is the fraction of retrieved instances that are
relevant as it appear in equation4.2.Both precision and recall are therefore based
on an understanding and measure of relevance. Precision and recall values often
depend on parameter tuning; thats mean there is a trade-off between precision
and recall. This is why another measure that combined both of the precision and
recall used: the F-measure which is defined as follows:
F measure = 2(Precision Recal) (Precision +Recall) 4.3
For obtaining estimates of precision and recall relative to the whole category
set, two different methods may be adopted:

46


Table 4-3The Global Contingency Table
Category set
C={ c1,....,c|C }
Expert Judgments
YES NO
Classifier
Judgments
YES

=
=
| |
1
i
TP TP
C
i

=
=
| |
1
i
FN F
C
i
N

NO

=
=
| |
1
i
FP FP
C
i

=
=
| |
1
i
TN T
C
i
N


Macroaveraging: precision and recall are first evaluated locally for each
category, and then globally by averaging over the results of the different
categories.

Table 4-4Macro-Average
Precision Recall
Macroaveraging
| | | |
Pr
| |
1
| |
1
C
FN TP
TP
C
C
i i i
i
C
i
i

= =
+
= =
| | | |
Re
| |
1
| |
1
C
FP TP
TP
C
C
i i i
i
C
i
i

= =
+
= =

Microaveraging: precision and recall are obtained by globally summing over
all individual decisions. For this, the global contingency table of table 4-4,
obtained by summing over all category-specific contingency tables, is needed.





47

Table 4-5Micro-Average
Precision Recall
Microaveraging


=
=
+
=
+
=
| |
1
| |
1
) (
C
i
i i
C
i
i
FN TP
TP
FP TP
TP


=
=
+
=
+
=
| |
1
| |
1
) (
C
i
i i
C
i
i
FP TP
TP
FN TP
TP


Macro- and Micro-averaging formulas for precision and recall are shown in
tables4-4 and 4-5.
There are some differences between Micro-averaged and macro-averaged. The
dissimilarities between two of them can be large. Micro-averaged results give
equal weight to the documents and thus emphasize larger topics, while macro-
averaged results give equal weight to the topics and thus emphasize smaller
topics more than micro-averaged results. As a result, the ability of a classifier
to behave well on categories with low generality (categories with few positive
training instances) will be emphasized by macro-averaging and much less so by
micro-averaging. Micro-averaged results are therefore really a measure of
performance on the large classes in a test collection [32]. To get a sense of
performance on small classes, macro-averaged results should be computed.
Whether one or the other should be used obviously depends on the application
requirements.
In single-label classification, micro-averaged precision equals recall, and is
equal to F1, so only micro F1 will be noted for the micro-averaged results.




48


4.4. Evaluation Results
The results were obtained for each of the k-nearest neighbor, Rocchio, and
Nave Bayes using MMM as follow:
4.4.1. Naive Bayes Algorithm Using (MMM).
Table4-6shows the confusion matrix for Nave Bayes using MMM algorithm.
The numbers reported in an entry of a confusion matrix correspond to the
number of documents that are known to actually belong to the category given
by the row header of the matrix, but that are assigned by NB using MMM to the
category given by the column header.
As shown in the table4-6; 67 documents of category Computer are classified
correctly into Computer category while 3 documents of Computer classified
incorrectly where 2 of these 3 documents classified as Education and 1 from 3
classified as law. The best classification at category is Sport where 231 of this
category classified correctly. Lowest value of correctly classified documents for
Education category where 56 documents classified correctly and 12 documents
classified incorrectly.







49


Table 4-6Confusion Matrix Results for NB Using MMM Algorithm

Figure4-1 shows recall, precision and f-measure for every category when the
Nave Bayes classifier was used, the precision reach it is highest value(1) for
the Sport , and computer categories while the lowest value of precision was
(0.812) for education category. Recall reaches its highest value (0.996) for sport
category and its lowest value for law category (0.804). F-measure reaches its
heights value (0.998) for sport category and its lowest value for education
category (0.818). The rest of the figure is self-exploratory.







50



Table 4-7 Confusion Matrix Results for NB Algorithm

The figure 4-1 shows the precision, recall, and f-measure for all the categories
that classified using Nave Bayes by using MMM.
Precision Recall f-measure
Actual Computer 1 0.957 0.978
Economy 0.864 0.841 0.852
Education 0.812 0.824 0.818
Engineer 0.948 0.948 0.948
Law 0.839 0.804 0.821
Medicine 0.996 0.991 0.993
Politics 0.833 0.918 0.873
Religion 0.905 0.885 0.895
Sport 1 0.996 0.998
51


Figure 4-1Result Of the Naive Bayes MMM Classification Algorithm

Table 4-8 shows the average of the above values for all categories in MMM
algorithm, the overall f-measure is 0. 908. This value consider high.


Table 4-8NB Using MMM Classifier Weighted Average for the Nine Categories
Nave Bayes using
MMM
Precision Recall F-measure
Weighted average 0.911 0.907 0.908

4.4.2. Comparisons MMM With Other Techniques and
Discussions 0f Results
Firstly a comparison made between k-NN, Rocchio and Nave Bayes
classifiers. All the results of KNN and Rocchio have been taken from [33]. A
summary of the recall, precision and F1 measures are shown in table 4.9.
Nave Bayes gave the best F-measures with MiF1=0.9185 and MaF1=0.908,
followed by kNN widf with MiF1=0. 7970 and MaF1=0. 7871, closely
0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
f-measure
Precision, Recall, and F- measure
52

followed by Rocchio tf.idf with MiF1=0. 7314 and MaF1=0. 7882. A
comparison of values of MiF1 and MaF1 is shown in figure 4-2.
Table 4-9Classifier Comparison
Method maP MaR maF1 miF1
kNN tf 0.7100 0.5359 0.6100 0.5711
kNN tfidf 0.8363 0.6902 0.7562 0.7272
kNN widf 0.8094 0.7662 0.7871 0.7970
Rocchio tf 0.5727 0.4501 0.5022 0.4427
Rocchio tfidf 0.8515 0.7337 0.7882 0.7314
Rocchio widf 0.7796 0.7199 0.7484 0.6968
Nave Bayes 0.911 0.907 0.908 0.9185

The figure 4-2, show the maFi, miF1 for all the classifiers (KNN, Rocchio, and
Nave Bayes), from the figure, we can see that the Nave Bayes using MMM
got the higher value for both values (maF1, and miF1).

Figure 4-2Maf1, Mif1 Comparison for Classifiers

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio
tfidf
Rocchio
widf
Nave Bayes
maF1
miF1
53

The next figure 4-3 shows the macro precision of all the categories, and it
appears that the highest value is for the Naive Bayes using MMM, Rocchio, and
also KNN tf.idf not far away from Rocchio.

Figure 4-3Map Comparison for Classifiers

The next figure 4-4 shows the macro recall of all the categories, and it appears
that the highest value is for the naive Bayes using MMM, KNN and Rocchio is
not far away from KNN.


Figure 4-4 MaR Comparison for Classifiers
0.71
0.8363
0.8094
0.5727
0.8515
0.7796
0.911
0
0.2
0.4
0.6
0.8
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio
widf
Nave Bayes
maP
maP
0.5359
0.6902
0.7662
0.4501
0.7337
0.7199
0.907
0
0.2
0.4
0.6
0.8
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchio
widf
Nave Bayes
MaR
MaR
54


It is clear that Naive Bayes classifier has the high values for the three measures
and then KNN classifier comes in the second place, the worst values in the three
measures was for Rocchio. As observed also there is disproportion in the
precision, recall and f-measure values for the k-NN where it is reach to high
value (0.83) at precision measure and very low at recall (0.53). As shown also
the values of precision, recall and f-measure values for the other two classifiers:
Rocchio and Nave Bayes classifiers more stability.


Figure 4-5Precision, Recall and F-Measure for the Three Classifiers

4.5. Results of Nave Bayes algorithm (MMM) with 5070
documents
Another experiment has been conducted, the collected corpus has showed in
Table4-10 contains 5070 documents that vary in length these documents fall
into six categories: Business, Entertainment, Middle East news, Sport, World
news, and Science and Technology.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
kNN tf kNN tfidf kNN widf Rocchio tf Rocchio
tfidf
Rocchio
widf
Nave Bayes
maP
MaR
maF1
55


Table 4-10Categories and Their Distributions in the Corpus (5070 Documents)
NO Category Number
1 Business 836
2 Entertainment 474
3 Middle East news 1462
4 Sport 762
5 World news 1010
6 Science and Technology 526

Table 4-11 shows the confusion matrix for Nave Bayes using MMM algorithm.
Lowest value of correctly classified documents for Entertainment category
where 400 documents classified correctly and 74 documents classified
incorrectly.

Table 4-11 Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents)

56

Figure 4-6 shows recall, precision and f-measure for every category when the
Naive Bayes classifier was used, the precision reach it is highest value(0. 991)
for the Sport category while the lowest value of precision was (0. 746) for
Entertainment category. Recall reaches its highest value (0. 979) for Sport
category and its lowest value for Middle East news category (0. 832). F-measure
reaches its heights value (0. 985) for Sport category and its lowest value for
Entertainment category (0. 792). The rest of the figure is self-exploratory.


Table 4-12Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents)

The next figure 4-6 shows the precision and recall for all the categories that
classified using Nave Bayes by using MMM.

57


Figure 4-6Result Of the Naive Bayes Classification Algorithm

Table4-13 shows the average of the above values for all categories in NB
algorithm, the overall f-measure is 0. 884.

Table 4-13Nab Using Mmm Classifier Weighted Average for the Six Categories in the Arabic Corpus
(5070 Documents)
Nave Bayes using
MNM
Precision Recall F-measure
Weighted average 0.882 0.890 0. 884

Comparing the overall result from table 4-8 and 4-13 show that there is a little
degrades in performance of precision, recall, and F-measure. Its because the
testing set still as its 4-fold cross validation. If the test set was in percent, the
results will be different since the classifier will learn more.

0
0.2
0.4
0.6
0.8
1
1.2
Precision
Recall
58

4.6. Summary
The Naive Bayes using MMM outperformed k-NN and Rocchio classifiers.
Naive Bayes (MMM) classifier has the best result and then the other techniques
came after Naive Bayes.



















59












5. Chapter Five: Conclusion and Future Work


..







60

5.1. Conclusion:
Text classification for Arabic languages has been investigated in this project.
Three classifiers were compared: KNN, Rocchio and Naive Bayes using
Multinomial Mixture Model (MMM).
Unclassified documents were pre-processed by removing stopwords and
punctuation marks. The rest of words was stemmed and stored in feather
vectors. Every test document has its own feature vector. Finally the document
will be classified to the best class according to the classifier technique.
The accuracy of classifiers has been measured using recall, precision and F-
measure. For project experiments the classifiers were tested using 1445
document. The result shows that the performance of NB by using Multinomial
model outperformed the other two classifiers.

5.2. Future Work:
As a future work, we plan to continue working with Arabic text categorization
as this area not widely explored in the literature and trying the classifiers on a
huge collection.
Apply an auxiliary feature method with Multinomial model in order to improve
classification accuracy.
Comparing the Nave Bayes MMM model with different models such as the
multivariate Bernoulli [9].
Evaluate Bpso feature selection with Multinomial classifier by using the same
Arabic database was mention it in [35], then compare the result between the two
achieved result.
61

Reference
[1] Hasan, M.M.: Can Information Retrieval techniques automatic assessment challenges?, in
Editor (Ed.)^(Eds.): Book Can Information Retrieval techniques automatic assessment
challenges? (2009, edn.), pp. 111-338

[2] Ghwanmeh, S., Kanaan, G., Al-Shalabi, R., and Ababneh, A.: Enhanced Arabic Information
Retrieval System based on Arabic Text Classification, in Editor (Ed.)^(Eds.): Book Enhanced
Arabic Information Retrieval System based on Arabic Text Classification (2007, edn.), pp.
461-465

[3] Duwairi, R.: Arabic Text Categorization, International Arab Journal on Information
Technology, 2007, 4, (2)

[4] Ko, Y., Park, J., and Seo, J.: Improving text categorization using the importance of
sentences, Information Processing & Management, 2004, 40, (3), pp. 65-79

[5] Ko, Y., and Seo, J.: Text classification from unlabeled documents with bootstrapping and
feature projection techniques, Information Processing & Management, 2009, 45, (3), pp.
70-83

[6] Chen, J., Huang, H., Tian, S., and Qu, Y.: Feature selection for text classification with Nave
Bayes, Expert Systems with Applications, 2009, 36, (3, Part 1), pp. 5432-5435

[7] Mesleh, A.M., and Kanaan, G.: Support vector machine text classification system: Using Ant
Colony Optimization based feature subset selection, in Editor (Ed.)^(Eds.): Book Support
vector machine text classification system: Using Ant Colony Optimization based feature
subset selection (2008, edn.), pp. 341-148

[8] Duwairi, R.M.: Arabic Text Categorization, Int. Arab J. Inf. Technol., 2007, 4, (2), pp. 325-
132

[9] Duwairi, R.M.: Machine learning for Arabic text categorization, Journal of the American
Society for Information Science and Technology, 2006, 57, (8), pp. 1005-1010

[10] Abboud, P.F., and McCarus, E.N.: Elementary Modern Standard Arabic: Volume 3,
Pronunciation and Writing; Lessons 1-10 (Cambridge University Press, 3981. 3981)

[11] Chen, A., and Gey, F.C.: Building an Arabic Stemmer for Information Retrieval, in Editor
(Ed.)^(Eds.): Book Building an Arabic Stemmer for Information Retrieval (2002, edn.), pp.

[12] Zrigui, M., Ayadi, R., Mars, M., and Maraoui, M.: Arabic Text Classification Framework Based
on Latent Dirichlet Allocation, Journal of Computing and Information Technology, 2032, 20,
(2), pp. 125-140

62

[13] Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., and Ramraj, N.: A novel term weighting
scheme MIDF for Text Categorization, Journal of Engineering Science and Technology,
2010, 5, (1), pp. 94-107

[14] https://sites.google.com/site/motazsite/Home/osac, 2010

[15] ZRIGUI, M., AYADI, R., MARS, M. & MARAOUI, M. 2012. Arabic Text Classification
Framework Based on Latent Dirichlet Allocation. Journal of Computing and Information
Technology, 20, 125-140.

[16] SETTLES, B. 2010. Active learning literature survey. University of Wisconsin, Madison.

[17] NOAAN, H. M., ELMOUGY, S., GHONEIM, A. & HAMZA, T. Naive Bayes Classifier Based Arabic
Document Categorization. Informatics and Systems (INFOS), 2010 The 7th International
Conference on. IEEE, 1-5.

[18] Pang, B., Lee, L., and Vaithyanathan, S.: Thumbs up?: sentiment classification using machine
learning techniques, in Editor (Ed.)^(Eds.): Book Thumbs up?: sentiment classification
using machine learning techniques (Association for Computational Linguistics, 2002, edn.),
pp. 79-86

[19] Pang, B., and Lee, L.: Opinion mining and sentiment analysis, Foundations and trends in
information retrieval, 2008, 2, (1-2), pp. 1-135

[20] Turney, P.D.: Thumbs up or thumbs down?: semantic orientation applied to unsupervised
classification of reviews, in Editor (Ed.)^(Eds.): Book Thumbs up or thumbs down?:
semantic orientation applied to unsupervised classification of reviews (Association for
Computational Linguistics, 2002, edn.), pp. 417-424

[21] Gamon, M., and Aue, A.: Automatic identification of sentiment vocabulary: exploiting low
association with known sentiment terms, in Editor (Ed.)^(Eds.): Book Automatic
identification of sentiment vocabulary: exploiting low association with known sentiment
terms (Association for Computational Linguistics, 2005, edn.), pp. 57-64

[22] Mesleh, A.M.d.: Feature sub-set selection metrics for Arabic text classification, Pattern
Recognition Letters, 32, (14), pp. 1922-1929

[23] Zhang, J., Chen, L., and Guo, G.: Projected-prototype based classifier for text
categorization, Knowledge-Based Systems, 2013, 49, (0), pp. 179-189

[24] Duwairi, R.: Arabic text categorization, the international Arab Journal of information
Technology, 2007, 7
63


[25] Kanaan, G., AlShalabi, R., Ghwanmeh, S., and AlMa'adeed, H.: A comparison of
textclassification techniques applied to Arabic text, Journal of the American Society for
Information Science and Technology, 2009, 60, (9), pp. 1836-1844

[26] El-Halees, A.: Arabic text classification using maximum entropy, The Islamic University
Journal (Series of Natural Studies and Engineering) Vol, 2007, 15, pp. 157-167

[27] MESLEH, A.M.: Chi square feature extraction based SVMs Arabic language text
categorization system, Journal of Computer Science, 2007, 1, (6), pp. 410

[28] Al-Shalabi, R., Kanaan, G., and Gharaibeh, M.: Arabic text categorization using kNN
algorithm, Proc. 4th Internat. Multiconf. on Computer Science and Information Technology
(CSIT 2006), 2006,

[29] Syiam, M.M., Fayed, Z.T., and Habib, M.: An intelligent system for Arabic text
categorization, International Journal of Intelligent Computing and Information Sciences,
2006, 6, (1), pp. 1-19

[30] Sawaf, H., Zaplo, J., and Ney, H.: Statistical classification methods for Arabic news articles,
Natural Language Processing in ACL2001, Toulouse, France, 2001

[31] Hmeidi, I., Hawashin, B., and El-Qawasmeh, E.: Performance of KNN and SVM classifiers on
full word Arabic articles, Advanced Engineering Informatics, 2008, 22, (3), pp. 306-111
[32] Alsaleem, S.: Automated Arabic Text Categorization Using SVM and NB, Int. Arab J. e-
Technol., 2011, 2, (2), pp. 124-128

[33] Al-diabat, M.: Arabic Text Categorization Using Classification Rule Mining, Applied
Mathematical Sciences, 2012, 6, (81), pp. 4033-4046

[34] Al-Kabi, M., Wahsheh, H., Alsmadi, I., Al-Shawakfa, E., Wahbeh, A., and Al-Hmoud, A.:
Content-based analysis to detect Arabic web spam, Journal of Information Science, 2012,
38, (3), pp. 284-296

[35] Mitra, V., Wang, C.-J., and Banerjee, S.: Text classification: A least square support vector
machine approach, Applied Soft Computing, 2007, 7, (1), pp. 908-914

Vous aimerez peut-être aussi