Vous êtes sur la page 1sur 9

International Journal of Computer Information Systems,

Vol.4, No.2, 2012



Document Clustering Based on New Ontological
Representation and Fuzzy Inference System

Maryam Amiri, Hassan Khotanluo
Computer Engineering Dep.
Bu Ali Sina University
Hamedan, Iran
{m.amiri,hkh}@basu.ac.ir

Kazim Fouladi
Computer Engineering Dep.
Tehran University
Tehran, Iran
kazim@fouladi.ir


Abstract Data mining is the process to discover unknown
knowledge from a large amount of data. Text clustering is one of
important techniques of text mining, which is the unsupervised
classification of similar documents into different groups. The
most important steps in document clustering are how documents
are represented and the measurement of similarities between
them. By giving a new ontological representation and a similarity
measure, this research focuses on improving the performance of
text clustering. The text clustering algorithm has been
investigated in three aspects: ontological representation of
documents, documents similarity measure, fuzzy inference
system to measuring the final similarities. Ultimately, the
clustering is carried out by hierarchical clustering. In order to
evaluate, the offered method has been compared with the results
of Nave Bayes method and an ontology based algorithm. The
results indicate that the proposed method improves the precision,
recall, F-measure and accuracy and produces more meaningful
results.
Keywords- semantic graph; ontological representation; ontology;
similarity measure; fuzzy inference system
I. INTRODUCTION
Solving the information overload problem needs to
improve and develop effective methods of document mining.
Having an appropriate conceptual representation model of
documents is a fundamental activity in document mining.
Existing document mining systems have shown some
limitations in delivering meaningful output. The gathering,
management and organization in these systems are based on
the low information representation instead of higher
knowledge levels. There is a lack in understanding of
documents contents, and instead there is a focus on the
presence or absence of keywords to mine texts. Breaking
through this limitation will require document mining systems
to understand the texts they process, and pursue the mining
tasks according to the understanding scheme. The focus in
semantic-based document mining systems is on the analysis of
contents. In these systems, the meaning of documents resides
in the structure, constituency, and the reasoning about
words/phrases semantics. Similarity measures are defined on
these representations to yield meaningful distances
assessments. Mining processes, such as document clustering,
document classification, and information retrieval, that make
use of some or all of the mentioned components are forming
what is called the semantic-based document mining processes.
The output of such systems is more meaningful result of
previous systems. This framework of systems output more
meaningful results than what could be accomplished otherwise
[1].
The goal of this research work is to introduce a new
framework for document clustering systems that are based on
semantic understanding of the documents contents.
Framework of this research is to introduce a new ontological
representation of documents and a new similarity measure
between documents. Defining a fuzzy inference system in
order to estimate the similarity between two documents,
precision of procedures of documents of document mining
based on the concept and ontology has been improved.
The remainder of the paper is organized as follows. After a
brief overview on document representation and similarity
measure models in the next Section, proposed ontological
representation and similarity measure related to ontological
representation are described in Section III and IV respectively.
In section V, fuzzy inference system and clustering algorithm
is discussed. Evaluation proposed method is presented in
Section VI. We conclude in Section VII by summarizing and
drawing some future works.
II. REVIEW OF DOCUMENT REPRESENTATION AND
SIMILARITY MEASURE
Conventional documents representation models are based
on whether a document contains specific words or their
appearance frequencies. For example, in the vector space
model [2-5] documents are represented by vectors containing
the frequency of occurrence of all possible words in a
document set. Since many words in a specific document rarely
occur in a particular document, many of these features (words)
will have low or zero frequencies. Therefore, the features are
selected for document representation according to their
February Issue Page 6 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
importance as dictated by criteria such as document
frequency-inverse document frequency, information gain,
mutual information and terms length [2,6]. Moreover, before
applying feature selection, pre-processing such as noise words
removal and stemming [7-8] is applied.
Despite the widespread use of the word-based approaches
to represent documents, it is believed that these approaches
contribute to the lack of reliable performance of document
mining systems. These approaches consider documents such
as bag of words and ignore meaning. It is this deficiency that
can cause the similarity measures to either fail to perceive
contextual similarity of text passages due to the variation of
words the passages contain, or perceive contextually dissimilar
text passages as being similar because of the resemblance of
words the passages have [1].
Some attempts have been made to improve the text
representation. Some of the methods which are based on
keywords, such as N-grams [9] and bigrams [10], make use of
slider windows to move on the text and also extract synthetic
terms. Another method is LSI. This method is an algebraic-
based algorithm which is used to represent documents [11-13].
LSI assumes that there is some underlying or latent structure
in the pattern of word usage across documents, and that
statistical techniques can be used to estimate this structure. In
[14-16] learning algorithms for mapping sentences to meaning
is presented.
Newer methods used background knowledge (domain).
Ontology is a collection of concepts and their
interrelationships which can collectively provide an abstract
view of an application domain. Figure 1 shows a simple
example of ontology. Ontology can be used to increase the
ability to reason about natural language texts [17].
Recently more studies have focused on natural language
processing which is based on ontology. This processing tried
to define representation based on the ontology [18] in order to
use ontological resources in language processing approaches
[19]. In conceptual representations based on ontology, each
word is mapped to a higher level concept in the ontology and
sequence of concepts will be re-considered [1]. The key point
in ontological representation is mapping words to appropriate
concepts in domain ontology. In [20] a conceptual
representation, based on ontology is presented which maps the
meaning of each text into an acyclic graph. Sentences are
labeled for extraction of POS (part of speech) and Chunk. The
words are mapped into the corresponding domain ontology
concepts. So it creates a conceptual representation for test
collection from train collection.
After converting documents into a specified format,
similarity measure is used for determining distance between
documents. There are many ways to measure similarity that
are based on document representation models. In the vector
space model, space feature is a geometric space and
documents are defined as the points in this space. Thus, the
distance between documents can be easily calculated [1]. In
[21] a similarity criterion is defined on sequence of words to
measure similarities between documents.


This criterion investigates a number of common sequences
between two documents with regard to the length, frequency
and level of the means. In the graph and ontological document
representation, document is decomposed into main and
detailed concepts. Then, for each type of main and detailed
concepts, similarity between two documents is calculated
according to
ST
X
TT
=
where ST is the number of common
main concepts (detailed) and TT is the total main concepts
(detailed) in both documents [22].
III. PROPOSED ONTOLOGICAL REPRESENTATION
Presented method in this paper produces an ontological
weighted graph. Regarding contents and concepts of
documents, key concepts are identified and according to the
importance of them in document, concepts weights are
determined. Conceptual structure of the document is
identified. Extracted concepts in the previous step, according
to structure of document, are connected together via directed
and weighted edges. In following, proposed method is
discussed in more detail.
A. The Initial Processing
A paragraph is a collection of several sentences that
discusses about a particular concept. In the pre-processing, the
processing units are paragraphs. In the beginning, text is
divided into its paragraphs. Then for each paragraph,
preprocessing such as tokenization, elimination of noise words
and stemming are done. Finally, for each paragraph two sets of
tokens are kept: a set of main tokens and a set of roots of the
original tokens.
B. Mapping Word Into Concepts Ontology
In this research, the method of converting the ontology to a
database such as RDBMS is used in order to extract data
through using query languages. Thus ontology is mined from
an OWL file and then is stored in a relational database. The
Figure 1- sample ontology includes concepts and relations [17]
February Issue Page 7 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
created table is converted to several smaller tables of classes,
instances, class-class matrix and class-instance matrix. Class
class matrix contains a number of relationships between each
pair of classes. Class-instance matrix determines what instance
belongs to what class. Using ONELOOKE reverse dictionary
[23] another table of concepts words is made. For each
concept in the ontology, one hundred words which are
associated with each of these concepts are found in the
dictionary and are stored in the database. Extracted words
from the dictionary are reviewed by domain experts and then
the most relevant words are selected. In order to map words
into concepts, with respect to concepts in the ontology, several
levels of tokens are examined. At first, the direct concepts are
considered. Direct concepts are defined as a term (token) in
the document which is a concept in ontology. To consider the
direct concepts, two levels of children and parents of the same
concepts are needed. These concepts refer to type-1 indirect
concepts. If the term is not a direct concept then it's considered
as an instance. In this case, two levels of the parents of that
instance are added to indirect concepts-1. If the intended term
is not a direct concept or an instance, then indirect concepts of
type 2 (indirect-concepts-2) are reviewed. This type of indirect
concepts is obtained from search in the concepts - words table.
Finally there will be a collection for each paragraph. This
collection includes direct and indirect concepts of types 1 and
2 with the reference frequency to them in relevant paragraph
and the term distance to main concept. Also, the number of
times that each concept is selected as a parent or a child is
considered.
C. Disambiguation Of Concepts
In the previous section, the indirect concepts type 1 and 2
were defined. The ambiguous problem may arise in relation to
these concepts. A direct concept may have several parents or
children as indirect concepts type 1. Also, a word may
correspond to several concepts. In all these cases the problem
of ambiguity arises. The proposed way to resolve the
ambiguity in this paper can be defined this way; with regard to
other direct and indirect disambiguous concepts in each
paragraph, ambiguous concepts are disambiguated and the
most relevant concepts are selected. In disambiguation
method, firstly, the importance of the ambiguous concept of
the relevant paragraph is specified. Equation (1) defines
importance of ambiguous concept where oc.counter is
references frequencies to this concept,

oc.tag is number of
times that this concept has been identified as ambiguous and
oc.distance is the distance between indirect
ambiguous concept and direct concepts.


m = 0.7 (oc.counter - oc.tag) / oc.distance (1)


If the concept is quite ambiguous, it is attributed to 0.1.
Then this concept is evaluated to other direct concepts and
indirect disambiguous concepts. Formula (2) shows evaluation
to direct concepts where 2 OC is a direct concept and OC is
an indirect ambiguous concept.

find_distance function finds
the distance between two concepts in the ontology and
rout_number discovers the number of possible routes
between the two concepts. Formula (2) is calculated for all
direct concepts and the results are added.

mark= oc2.counter (2 / find_distance(oc, oc2))
rout_number(oc, oc2) m


(2)
Formula (3) evaluates ambiguous concepts towards
indirect disambiguous concepts. 2 OC is an indirect
disambiguous concept and OC is an indirect ambiguous
concept. This formula is calculated for all the indirect
unambiguous concepts and results are added.

mark=oc2.counter 1 / oc2.distance
(1.5 / find_distance(oc, oc2))
rout_number(oc, oc2) m

(3)
The results of the formulas (2) and (3) are added, if the
result is more than a threshold, then ambiguity is removed from
this concept, otherwise the target concept will be deleted. If
the ratio of the disambiguated concepts to whole ambiguous
concepts of the paragraph is less than a specified threshold,
disambiguous concepts of the previous and the next paragraphs
are examined and ambiguous concepts of the current paragraph
are disambiguated. If the current paragraph is low in terms of
disambiguous concepts, then this examination causes that the
concepts will be evaluated more attentively and not to be
removed without cause.
D. Extraction Conceptual Hierarchy Structure
After performing the above steps, there will be a set of
direct and indirect concepts and also some information about
these concepts for each paragraph. The final stage is weighting
concepts and drawing the document graph schema. For
weighting a document concept, the number of references of
each concept is divided by the total number of references of
the specified concept in the whole document. Due to
importance of direct concepts in document content, all the
direct concepts with Coefficient 0.9 and other concepts in
paragraphs with 0.7 are considered. After calculating the
concepts weights, direct concepts and instances are selected
and then using class-class matrix and class instance matrix,
their relationships are extracted. Therefore graph nodes and
weights associated with them have been created.
In order to draw graph edges and calculate their weights,
all the concepts of type indirect-2 are considered as direct
concepts. Also concepts indirect-1 that are homonymous with
direct concepts are examined. If these indirect homonymous
concepts are selected as children therefore edges direction are
from direct concept that is its father to target direct concept. If
these indirect homonymous concepts are selected as parent,
edges' directions are from direct children of this concept to the
relevant concept. For calculating the edge weights in the
graph, matrix of the relationship is used. Indirect homonymous
concepts weights are multiplied by the number of relationships
February Issue Page 8 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
between two desired concepts and are divided by the number
of desired nodes associated with their parents (or children).
Also proportional to the distance of indirect homonymous
concepts a coefficient is considered. In formula (4) the
purpose is to examine parents with one distance. If i is a
direct concept that is homonymous with indirect parent with
one distance, then for all its children, j , edges weights j to

i are calculated as (4) where _ sum child
i
represents the
number of children of concept i with one distance,
. concept weight
i
is weight of indirect concept homonymous
with main concept and [ , ] matrix i j represents the number of
relationships between i and j . Also coefficient W can be
amounted, with respect to distance of indirect concept.
Formula (4) is written the same way for children.
.
[ , ] / _
weight concept weight W
j i i
matrix i j sum child
i
=

(4)
Finally, after calculating states of children and fathers with
different lengths, a directional graph is produced, that its
nodes are direct concepts. Concepts weights and the edges
weights have been calculated through expressed description
and the formula (4). Created graph of each document is stored
in the database as a matrix so that in calculation steps of
similarity matrix between documents and their mining, can be
retrieved easily.
As an example, Figure 2 is a desired document on hotel
domain (using the ontology). Figure 3 shows a generated
ontological graph which is based on the proposed method.
This graph includes concepts, weights and their edges.
Regarding concepts weights, document content is about the
hotel and luxury hotel concepts. According to these weights,
general, significant and detailed concepts can be specified
easily. In other words, these weights can be interpreted as
fuzzy membership degree that document with what
membership degree belongs to each concept. This
interpretation can be made about the weights and directions of
the edges. In fact, according to current ontology, document
ontological schema is a subset of domain ontology. In domain
ontology, weights of nodes and edges are 0.100 but in
documents, according to content and context, the values of
these weights are different.
IV. SIMILARTY MEASURE RELATED TO ONTOLOGICAL
REPRESENTATION
The most important steps, for Improvement document
mining procedures, are conceptual representations and
similarity measure, related to this representation. The more
similarity measure is capable for Approximation of levels of
differences and similarities between documents, the more
appropriate and practical it is.







The proposed ontological method has four meaningful
parts that are used to determine similarities and differences
between documents: concepts and the weights corresponding
to each concept, edges and the weights assigned to each edge.
The proposed criterion for the concepts and the edges are
calculated separately and its output is distinct similarities
matrices for the concepts and edges. In the next steps based on
the calculated similarity matrix and use of fuzzy inference
systems, fuzzy rules and an algorithm of document clustering,
mining results are improved.
The proposed criterion considers membership degree,
priority and importance of each concept and also approximates
the amount of similarity between the documents based on
common concepts (common edge) of both documents. For
every common concept in two documents, the two weights
1
w and
2
w are calculated. Finally using these weights,
similarity of two documents can be approximated. Formulas 5
and 6, respectively, express calculation of weights
1
w and
2
w
where
1
w is weight related to differences of priority and
importance of concepts and
2
w is weight related to Different
weights of common concepts in two documents. In Formula 5,
i
concept is Common Concept concept

in document i . Also

( )
i
order concept defines priority of concept concept in
document i . x and y

are two desired documents and the
Figure2- a sample document in the field of the hotel
Figure3- ontological graph of document in figure (2)
February Issue Page 9 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
goal is to calculate the similarity between them.
max_ ( , ) length x y is defined as the maximum difference of
importance of common concepts between two documents. In
Formula 6, ) (
x
weight concept is common concept weight
concept in document x .
max_ ( , ) ( ( ) ( ))
1
max_ ( , )
length x y order concept order concept
x y
w
length x y

=
(5)
1 ( ) ( )
2
1
weight concept weight concept
x y
w

= (6)
Formula 7 represents similarity criterion for measuring the
similarity between the two documents x and y . In this
formula m represents the number of common concepts in
both documents, symbol Indicates collection size (concepts
number) and x y x y x y = + .
1 2
1
( , )
m
i
w w
sim x y
x y
=
=

(7)
V. FUZZY INFERENCE SYSTEM AND CLUSTERING
The similarity between each pair of documents can be
computed through the similarity measure offered in (7). For
this reason, first, main concepts, detailed concepts and main
edges of the main graph of each document are identified. Then
the level of final similarity between two documents is
approximated by applying a fuzzy inference system. In section
A, fuzzy inference system is expressed in detail and document
clustering is dealt with in section B.
A. Fuzzy Inference System
A fuzzy inference system contains three sections of
fuzzifier, fuzzy inference engine and defuzzifier. In fuzzifier
section, a crisp variable is converted to a linguistic variable
through defined membership functions. In the second section,
fuzzy output value is produced through fuzzy rules (if- then
rules). Defuzzifier section converts fuzzy output value to a
crisp value through defined membership functions. The
process has been presented in figure 4. The inference system
designed in this part has three inputs: similarity level of main
concepts, similarity level of detailed concepts and similarity
level of main edge in documents graph schema. Detailed and
main concepts of each document are determined relatively.
First, the existing maximum weight is identified. Then,
detailed and general concepts are specified for each document
through formula 8.



Max refers to the maximum weight amount and co.weight
specifies the concept weight amount.


Similarity level of main and detailed concepts and the
main edges are computed for both documents through
extracted concepts and the similarity the measure (formula 7).
Eventually, three similarity matrices S
1
, S
2
and S
3
are
produced with the dimensions of n n
and n refers to the
number of documents.

pq
i
S
is the similarity level of p and q documents in S
i

matrix. As it has been shown, S
i
matrix is a symmetric matrix
and the similarity of each document with itself is one.
1 ...
12 1
... ...
21 2
... ... ... ...
... ... 1
1
s s
i
s s
i
S
i
s
n
=
(
(
(


According to figures 5-7, three membership functions of
High, Low and Medium have been defined for each inference
system input. The horizontal axis shows similarity level
among documents and the vertical axis shows membership
degree.
Mamdani fuzzy inference engine has been used for
fuzzifier of input values. Mamdani fuzzy inference system
model uses min- min- max operator [24]. Figure 8 is an
example of a Mamdani fuzzy inference system.
max
0.1, : main concept
2 (max* 0.1)
if . :
max
0.05, : detail concept
2 (max* 0.1)
co weight
> >
+
> <
+
|

\
(8)
Figure 4- Fuzzy Inference System
(8)
February Issue Page 10 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012









Similarity Main_Edge Detailed_Concept Main_Concept No
High High High High 1
High Medium High High 2
High Low High High 3
Medium High Medium High 4
High Medium Medium High 5
High Low Medium High 6
Medium High Low High 7
Medium Medium Low High 8
High Low Low High 9
Low High High Low 10
Medium Medium High Low 11
Medium Low High Low 12
Low High Medium Low 13
Low Medium Medium Low 14
Medium Low Medium Low 15
Low High Low Low 16
Low Medium Low Low 17
Low Low Low Low 18
Medium High High Medium 19
High Medium High Medium 20
High Low High Medium 21
Medium High Medium Medium 22
Medium Medium Medium Medium 23
Medium Low Medium Medium 24
Low High Low Medium 25
Medium Medium Low Medium 26
Medium Low Low Medium 27

The designed inference engine uses the 27 fuzzy rules
expressed in table 1. Each row of the table is interpreted in this
way (Rule 1):
if main_concept is high and detail_concept is high
and main_edge is high then similarity is high
Fuzzy system output has three membership functions with
similarity values of High, Low and Medium. Finally, the final
similarity value between 2 documents is estimated through
defuzzification. So, the following processes can be stated for
computation of similarity between documents.
- Compute similarities between main and detailed
concepts and main edges of documents
Figure 5-Low Similarity membership function
Figure 6- Medium Similarity membership function
Figure 7- High Similarity membership function
Figure 8-Example of Mamdani Fuzzy Inference System
Table 1- Fuzzy Rules for fuzzy inference engine
February Issue Page 11 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
- Use of fuzzy inference system and produce fuzzy
output
- Defuzzification of output and compute final
similarities between documents
- Document clustering based on final similarity
matrix
There are two steps for defuzzification of output similarity
values. The first step is to specify the similarity level among
documents. The second step is defuzzification of this
similarity value. Max- finding method is used to specify the
similarity level among documents according to Mamdani
system.

B. Document Clustering
After computing the final similarity matrix among
documents, clustering is done through bottom up hierarchical
clustering algorithm. Clustering algorithm is done through the
following steps [22].
- Find the max value in the final similarity matrix
(S
ij
), and group the documents i and j into a new
cluster.
- Calculate the relationship between the new cluster
and other documents
- Go to Step (1), until there is only one cluster left.

Figure 10 shows an example of bottom-up hierarchical
clustering.

VI. EVALUATION PROPOSED METHOD
The results and experimental document collection of paper
[25] have been used to evaluate the suggested method. A
framework for clustering documents on the base of ontology
has been created in [25], and clustering have been dealt with
through combination of the method of vector space and the
existing concepts in ontology. Computer, wine and pizza
ontologies are used. Two hundred and fifty documents have
been selected from three domain of pizza, wine and computer.
One hundred and six documents belong to computer domain,
64 documents belong to pizza domain and 80 documents
belong to the domain of wine. Precision, recall, Fmeasure,
accuracy and error criteria have been used for evaluation. The
suggested method in this research is compared with the
method of paper [25] and Nave Bays method.




If FC is the number of documents which do not belong to
C
i
category but have been clustered in this category by
mistake, TC is the number of documents that belong to C
i

class and have also been clustered in this class, MC is the
number of documents that belong to C
i
category but they are
not in this category and have been clustered in other classes by
mistake and MM is the number of documents which do not
belong to C
i
category and have been clustered in other
categories, then, for each of the above criteria, formulas 9- 13
can be stated respectively:

















Precision and recall criteria express results in different
aspects. Precision is used for evaluating the accuracy rate of
Figure 9- Bottom up Hierarchical clustering
F -measure
1
2PrecisionRecall
=
Precision+Recall
TC MM
Accuracy
TC FC MC MM
+
=
+ + +
FC MC
Error
TC FC MC MM
+
=
+ + +
Recall
TC
MC TC
=
+
Precision
TC
FC TC
=
+
(9)
(10)
(11)
(12)
(13)
(9)
(10)
(11)
(12)
(13)
February Issue Page 12 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
clustering and recall criterion for reviewing clustering
integrity. To review both criteria simultaneously, F
1
-measure
is used [25].
Graphs 10-14 show evaluation of the results of three
clustering methods and compare them with the criteria
indicated above. Considering the diagrams and comparison of
the results, it is observed that, on the based on these criteria,
the suggested method has higher values in relation to other
methods. Also, its error rate is lower.
VII. CONCLUSION AND FUTURE WORK
The paper offers a new framework for document
clustering. The new method of ontological representation of
documents and similarity measure appropriate for this method
have been suggested in this framework. With three inputs and
one output, Fuzzy inference system is used to estimate
similarity level among documents. Evaluation results show
higher efficiency of this method. Fuzzy clustering of
documents, and improving sentence ontological representation
through conceptual analysis of sentences can be studied in the
future.





















REFERENCES

[1] Kh. Shaban, A Semantic Graph Model for Text Representation
and Matching in Document Mining: Doctor thesis. university of
Waterloo, Ontario, Canada, 2006.
[2] Aas, K., and Eikvil, L., Text categorisation: A survey, Technical
Report 941, Norwegian Computing Center, 1999.
[3] Berry, M. W., Dunais, S. T., and OBrien, G. W., Using Linear
Algebra for Intelligent Information Retrieval, SIAM Review
37(4), pp. 573-595, 1995.
Figure 10- evaluation with precision criterion
Figure 11- evaluation with recall criterion
Figure 12- evaluation with Fmeasure criterion
Figure 13- evaluation with Accuracy criterion
Figure 14- evaluation with Error criterion
February Issue Page 13 of 62 ISSN 2229 5208
International Journal of Computer Information Systems,
Vol.4, No.2, 2012
[4] Salton, G., and Mcgill, M. J., Introduction to Modern
Information Retrieval, McGraw-Hill, 1984.
[5] Salton, G., Wong, A., and Yang, C., A vector space model for
automatic indexing, Communications of the ACM, 18(11),
pp.613-620, 1975.
[6] Yang, Y., and Pedersen, J., A Comparative Study on Feature
Selection in Text Categorization, In Proceeding of the 14th
International Conference on Machine Learning, ICML, pp. 412-
420, Nashville, TN, 1997.
[7] Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., and Miller,
K., Introduction to WordNet: An On-line Lexical Database,
Cognitive Science Laboratory, Princeton University, 1993.
[8] . Porter, M. F. An algorithm for suffix stripping, Program, 14(3),
pp. 130-137, 1980.
[9] Suen, C., N-gram statistics for natural language understanding
and text processing, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 1(2), pp.164-172, 1979.
[10] Martinez, A. R., and Wegman, E. J., Text Stream
Transformation for Semantic-Based Clustering, Computing
Science and Statistics, 34, 2002 Proceedings.
[11] Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer T. K.,
and Harshman, R., Indexing by Latent Semantic Analysis,
Journal of the American Society for Information Science, 1990.
[12] Hasan, M., Matsumoto, Y. Document Clustering: Before and
After the Singular Value Decomposition, Sapporo, Japan,
Information Processing Society of Japan (IPSJ-TR: 99-NL-134.)
pp. 47-55, 1999.
[13] Ljungstrand, P.; and Johansson, H. Intranet indexing using
semantic document clustering. Master Thesis. Department of
Informatics, Gteborg University, 1997.
[14] Wong, YukWah and Raymond Mooney, Learning synchronous
grammars for semantic parsing with lambda calculus, In
Proceedings of the 45th Annual Meeting of the Association for
Computational Linguistics, 2007.
[15] Zettlemoyer, Luke S. and Michael Collins, Learning to map
sentences to logical form: Structured classification with
probabilistic categorial grammars, In Proceedings of UAI-05,
2005.
[16] He, Yulan and Steve Young, Spoken language understanding
using the hidden vector state model, Speech Communication
Special Issue on Spoken Language Understanding in
Conversational Systems, 48(3-4), 2006.
[17] B. Andr Solheim, K.Vgsnes, Ontological Representation of
Texts and its Applications in Text Analysis, Master Thesis,
Agder University College, 2003.
[18] Nirenburg, Sergei and Victor Raskin, Ontological Semantics,
MIT Press, 2004.
[19] Hovy, Eduard, Mitchell Marcus, Martha Palmer, Lance
Ramshaw, and Ralph Weischedel, Ontonotes: The 90% solution,
In Proceedings of HLT-NAACL 2006.
[20] S. Muresan, Learning to Map Text to Graph-based Meaning
Representations via Grammar Induction, Creative Commons
Attribution-Noncommercial-Share, 2008.
[21] Hammouda, K., and Kamel, M. Phrase-based document
similarity based on an index graph model, In Proceedings of the
2002 IEEE Int'l Conf. on Data Mining (ICDM'02), 2002.
[22] J. C. Trappey, Charles V. Trappey,Fu-Chiang Hsu, and David
W. Hsiao, A Fuzzy Ontological Knowledge Document
Clustering Methodology, IEEE TRANSACTIONS ON
SYSTEMS, CYBERNETICS, VOL. 39, NO. 3, JUNE 2009.
[23] http://www.onelook.com/reverse-dictionary.shtml
[24] E. H. Mamdani, Application of fuzzy algorithm for control of
simple dynamic plant, Proc. Inst. Elect. Eng., vol. 121, no. 12,
pp. 15851588, 1974.
[25] Yang, X.-q.; Sun, N.; Zhang, Y. & Kong, D-r, General
Framework for Text Classification based on Domain Ontology,
In SAMP 08: Proceedings of the 2008 Third International
Workshop on Semantic Media Adaptation and Personalization.
IEEE Computer Society. Washington DC, USA. pp. 147-152,
2008.




February Issue Page 14 of 62 ISSN 2229 5208