Vous êtes sur la page 1sur 24

Applied Soft Computing 53 (2017) 181204

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

A web page distillation strategy for efcient focused crawling based


on optimized Nave bayes (ONB) classier
Ahmed I. Saleh a, , Arwa E. Abulwafa a , Mohammed F. Al Rahmawy b
a
Dept. of Computer Eng. & Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt
b
Dept. of Computer Science, Faculty of Computers, Mansoura University, Mansoura, Egypt

a r t i c l e i n f o a b s t r a c t

Article history: The target of a focused crawler (FC) is to retrieve pages related to a specic domain of interest (DOI).
Received 15 September 2015 However, FCs may be hasted if bad links were injected into their crawling queue. Hence, they will be
Received in revised form 1 September 2016 gradually skewed away from their DOI. This paper introduces an effective modication on the behavior
Accepted 18 December 2016
of FCs by adding a domain distiller. Hence, before passing the retrieved page to the indexer or embedding
Available online 3 January 2017
its links into the crawling queue, the page must pass through a domain distiller. The proposed domain
distiller relies on an Optimized Nave Bayes (ONB) classier, which combines nave Bayes (NB) and Sup-
Keywords:
port Vector Machines (SVM). Initially, genetic algorithm (GA) is used to optimize the soft margins of
Wen page classication
Focused crawling
SVM. Then the optimized SVM is employed to eliminate the outliers from the available training exam-
Domain ontology ples. Next, the pruned examples are used to train the traditional NB classier. Moreover, ONB employs
Support vector machines word sense disambiguation (WSD) to identify the accurate sense of each domain keyword extracted from
Nave bayes the input page. This is accomplished by using a proposed domain ontology, which is called Disambigua-
Genetic algorithm tion Domain Ontology (D2 O). ONB has been tested against recent classication techniques. Experimental
results have proven the effectiveness of ONB as it introduces the maximum classication accuracy. Also,
results indicate that the proposed distiller improves the performance of focused crawling in terms of
crawling harvest rate.
2016 Elsevier B.V. All rights reserved.

1. Introduction called focused crawlers [3]. A VSE offers a good solution for the
general-purpose search engines limitations, as it covers only the
Due to the explosive growth in the eld of Internet and comput- portion of the web related to its DOI. On the other hand, VSEs can
ers, millions of web pages are daily added to Internet. Accordingly, easily provide more precise results and more customized functions.
searching the web becomes a true challenge [1]. Search engines However, building an accurate VSE is also a true challenge as the
(SEs) are information retrieval systems, which are designed mainly web is full of noisy and volatile materials.
to help users nding what they need. A Crawler is one of the Focused crawling aims to download only pages related to a spe-
main components of SE, which is used to retrieve web pages cic DOI. The web is then divided into a set of interconnected
then pass them to the indexer. In spite of their effectiveness, domains, and then one (or more) crawler(s) is (are) allowed to
the general-purpose search engines suffer from low precision and discover each domain. Hence, indexing the web can be done in par-
recall, freshness problem, poor retrieval rate, time consuming due allel; and accordingly, more web coverage can be easily achieved.
to the long list of result, and storage problem caused by the huge However, the area of focused crawling still has many challenges
amount of expanded information. To overcome those problems, and unsolved problems. One of the effective problems that harm
more specialized (vertical) search engines (VSE), which are called the focused crawling efciency is the Haste Problem. Such prob-
the domain-specic search engines [2], have been introduced. The lem takes place when bad links are injected into the crawling
aim of VSEs is to reply the users queries in a specic domain of queue of the focused crawler. After retrieving the pages behind
interest (DOI), and accordingly, they have a special type of crawlers those bad links, more bad links are added to the crawling queue
causing the focused crawler to be skewed away from its DOI.
The main cause of the haste problem is that traditional focused
Corresponding author. crawlers rely on estimation. Hence, they estimate whether the
E-mail address: aisaleh@yahoo.com (A.I. Saleh). page is related to DOI or not before actually retrieving the page.

http://dx.doi.org/10.1016/j.asoc.2016.12.028
1568-4946/ 2016 Elsevier B.V. All rights reserved.
182 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Although recent focused crawlers implement accurate estimation mance of focused crawling in terms of crawling harvest rate. This
techniques, relevancy estimation is not always accurate. To over- paper is organized as follows; section 2 introduces the background
come such problem, web mining techniques can be applied to and basic concepts, section 3 illustrates the effective focused crawl-
calculate a true relevancy score that is based on the actual pages ing that combines the traditional focused crawling with the domain
contents rather than such inaccurate estimation. To accomplish distiller, section 4 presents the previous efforts in the area of web
such aim, a domain distiller can be employed to calculate the rel- page classication, section 5 introduces the employed disambigua-
evancy of the page after actually retrieving it. Then the decision tion domain ontology (e.g., D2 O), section 6 illustrates in details
to pass the retrieved page to the search engines indexer or to add the proposed ONB classier, section 7 presents the performance
the pages embedded links to the crawling queue is based on the analysis and experimental results, while section 8 summarizes our
distillers decision. conclusions.
Web page classication can be dened as the assignment of a
web page to one (or more) predened classes. It is often posed as
a supervised learning problem. Hence, a set of training examples 2. Background and basic concepts
are used to train the classier by setting its classication rules,
which can be applied to classify future examples. Based on the In this section, an explanation about traditional focused
number of the employed classes, classication can be binary or crawlers as well as the crawling haste problem will be introduced.
multi-class. Binary classication categorizes items into exactly one Then, a brief introduction for word sense disambiguation (WSD)
of two classes, while multi-class classication employs more than will be illustrated.
two classes. Web page domain distillers are binary classiers that
are designed to take a decision whether an input web page is related 2.1. Web search engines
to a specic DOI. Several classication techniques can be employed
in domain distillers such as; support vector machines (SVM) [4], Search engines are the most popular search tools for nding
k-nearest Neighbor (KNN) [5], decision trees [6], neural networks the required information on the web. A typical search engine con-
[7], and Bayesian classier [8]. However, binary classication is still sists of ve basic components, which are; (i) crawler, which may
a challenge. be focused or unfocused, (ii) indexer, (iii) database, (iv) query man-
In current era, it has been found that results returned by search ager, and (v) user interface. Hence, the user sends his query to
engines are not what actually needed. This happens because, while the search engine in the form of search keywords. Then search
querying for certain data, there is a likelihood that the query engine retrieves the relevant pages to the users query from the
contains ambiguous (polysemous) words, which having multiple database. Finally, a ranked list of pages is presented to the user.
meaning. Word Sense Disambiguation (WSD) [9], which tries to Search engines rely on crawlers to traverse the web [11]. Crawlers,
assign a unique sense to a word, is an important area of NLP [10]. which may be focused or unfocused, collect pages, pass them to
Generally, WSD can be employed to promote the classication per- the indexer, and then follow links from one page to another [12].
formance as it can effectively solve the classier confusion. Several Indexer on the other hand, analyzes the page and stores its features
techniques can be used to implement WSD. One technique is based in the database.
on the collocation of other words in which nearby words are used to
provide consistent clues to the sense of a target word [9]. Another
technique is the word sense based on discourse in which the sense 2.2. Focused crawlers and the haste problem
is consistent within any given document.
The originality of this paper is concentrated in introducing a A focused crawler [13] is a special type of crawlers that retrieves
new architecture for focused crawling by integrating evidence from pages related to a specic topic or a domain of interest based on
Machine Learning and Web Mining. The paper introduces an effec- both content and link structures [14]. As illustrated in Fig. 1, the
tive modication on the behavior of focused crawlers by employing focused crawler operates in ve steps; initially, its priority crawl-
a domain distiller to decide whether the retrieved page is related ing queue is initialized with a number of seed pages [15], which are
to crawlers DOI, and then take a decision to index the page and add highly relevant pages that are manually chosen. Then, the focused
its links to the crawling queue accordingly. The proposed domain crawler fetches the link located at the head of its priority queue
distiller combines SVM and NB classiers in a new instance called and retrieves the corresponding page. In the third step, it analyzes
Optimized Nave Bayes (ONB) classier. Initially, genetic algorithm the page (using parsers to extract keywords and links). Forth, the
(GA) is used to optimize the soft margins of SVM. Then, the opti- focused crawler assigns a score for each link in the processed page
mized SVM is employed to eliminate the outliers from the available based on several criteria such as the link position in the page, the
training examples. Next, the pruned examples are used to train links anchor window, and/or the pages rank. Those extracted links,
the traditional NB classier. Furthermore, in order to guarantee after scoring them, are injected into the crawling queue. Finally,
an effective classication task, WSD has been implemented to cre- the crawler sorts the links in its queue so that the links with higher
ate innovative features to perfectly represent the input page for scores appear at the queue head; hence, they will be processed rst.
the classication. With the help of WSD, a set of specially selected A focused crawler will continue operating as its queue has URLs for
ambiguous domain keywords, which is called Confusion Set (CS), processing. This procedure ensures that the crawler moves towards
is identied. Then, the sense of each ambiguous keyword is identi- relevant pages with the assumption that relevant pages tend to be
ed based on a pre-stored collection of discriminative keywords neighbors to each other. However, focused crawlers are very sen-
(for each ambiguous keyword), called Partners. CS is the sub- sitive to the quality of the seeds initially injected into its queue.
set of ambiguous domain keywords that most likely confuse the Hence, falsely chosen seeds will dramatically affect the crawler per-
classier. ONB employs proposed domain ontology for both map- formance. Moreover, inaccurate link scoring strategy badly impacts
ping domain keywords to the corresponding concepts as well as the crawler behavior in the future crawling cycles. Hence, more and
implementing WSD. Hence, it is called; Disambiguation Domain more bad links, which are extracted from the retrieved low quality
Ontology (D2 O). ONB has been tested against recent classication pages, will be added to the crawling queue. As a result, the crawler
techniques. Experimental results have proven the effectiveness of may be involuntary skewed away from its main target, which is
ONB as it introduces the maximum classication accuracy. Also, retrieving high quality pages relevant to a specic domain. We call
results indicate that the proposed distiller improves the perfor- such skew; the crawling Haste Problem.
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 183

Fig. 1. The Structure of a Typical Focused Crawler (TFC).

Table 1 ontologies. On the contrary, in corpus-based methods the informa-


Types of Word Sense Disambiguation.
tion is gathered from contexts of previously annotated instances
Type of ambiguity Denition Example (e.g., examples) of the word.
Homographs Words with same minute (extremely
spelling, same small, measure of time)
pronunciation but
3. Effective focused crawler
either same or
different meaning Usually, the crawling queue of the focused crawler is initially
Homonyms Words with same rose (ower, past tense fed with high quality pages (e.g., seeds), which are highly related to
spelling, same form of verb rise)
DOI. Also, the focused crawler maintains an efcient link weighting
pronunciation but
different meaning strategy. However, some of the retrieved pages may be obsolete
Heteronyms Words with same dove (bird, past tense as traditional focused crawlers suffer from Haste Problem. This
spelling but different of verb dive) happened because the focused crawler is a blind entity. Its oper-
pronunciation and ation relies mainly on predictions. A focused crawler retrieves a
different meaning
page if it predicts that it is a good page. But what is the situation if
the crawlers prediction is inaccurate. According to the traditional
focused crawling, the links extracted from any retrieved pages
2.3. Word sense disambiguation
are weighted, and then injected into the crawling queue. If such
retrieved page is irrelevant one (e.g., the page is not related to the
Human language is fairly ambiguous; hence, numerous words
domain of interest), bad links will be injected into the crawling
can be portrayed in several different ways based on the context in
queue causing the crawler to skew from its main target of retriev-
which they occur. Most of the words in natural languages are pol-
ing high quality pages that are related to a specic domain. To go
ysemous as they have multiple possible meanings or senses. Word
around such problem, as illustrated in Fig. 2, a domain distiller can
sense disambiguation (WSD) is dened as the task of identifying the
be used to guide the crawler operation and compensate its blind-
appropriate sense of an ambiguous word in a context. In English lan-
ness.
guage, there exist a variety of ambiguities types; Table 1 illustrates
Hence, before passing the retrieved page to the indexer or
the most famous ones.
adding pages links to the crawling queue, the page is passed to a
WSD typically involves two main tasks; (1) determining the
domain distiller to decide whether it is relevant to DOI. Only good
different possible senses (or meanings) of each word, and (2) tag-
pages (e.g., pages that are highly related to DOI) are passed to the
ging each word of a text with its appropriate sense. The former
indexer. The decision here is taken according to the pages contents;
task, that is, the precise denition of a sense, is still a chal-
hence, it is an accurate decision.
lenge within the Natural Language Processing (NLP) community.
Recently, the most used Sense Repository is WordNet [16]. On
the other hand, the second task (e.g., tagging of each word with 4. Related work
appropriate sense) involves the development of a system capa-
ble of tagging polysemic words in running text with sense labels. The main contribution of this paper is to enhance the perfor-
WSD community classies these systems into two main general mance of focused crawling by using domain distillers. A distiller is
categories, namely; knowledge-based and corpus-based. Although a binary classier that is initially learned with the knowledge of a
both categories build a representation of the examples to be tagged specic domain in the form of classication rules. Then, it can use
using some previously collected information, they differ in the such pre-stored rules to discover those web pages that are rele-
source of this collected information. Knowledge-based methods vant to crawlers DOI. In this section, a quick review summarizing
obtain the information from external knowledge sources such the recent work in the area of web page classication will be intro-
as Machine Readable Dictionaries (MRDs) and/or lexico-semantic duced, which can be applied in binary or multi-class classications.
184 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 2. the Structure of Effective Focused Crawler (EFC).

SVM has been applied to web page classication in [17], in which duced. This hybrid classier, which is called HKNNSVM, uses KNN
the original SVM classier is combined with BEV (Bagging Ensem- to prune training samples and uses SVM to classify samples. Com-
ble Variation) algorithm to create a new classier called VOTEM. A pared with SVM and KNN, the Misclassication rate of HKNNSVM
web document has been assigned to a sub-category based on voting for datasets were lower, which indicated that the classication per-
from all category-to-category classiers. In this work, a hierarchical formance of HKNNSVM was stable. [23] Proposes a discriminant
classication algorithm starts from the top of the hierarchical tree analysis method for categorization of text documents. It catego-
downward recursively until it triggers a stop condition or reaches rizes the text by nding coordinate transformations that reect
the leaf nodes. Because of the imbalanced data that decreases the similarity from data by using generalized singular value decompo-
performance of original SVM classier, VOTEM is used to provide sition (GSVD). However, the cost of classication is extremely high
an improved binary classier to solve the problem brought by BEV. in document analysis. In [24], an effective re-examination of text
In [18], a web page classication method using an SVM based on a categorization approaches of statistical test using ve categoriza-
weighted voting schema has been proposed. The feature vectors are tion method as KNN, SVM, NN, NB, and LLSF (Linear Least square
extracted from both the LSA (latent semantic analysis) and WPFS Fit) has been introduced. Among them SVM, KNN, LLSF outperform
(web page feature selection) methods. LSA can extract common the results than the NB, NN when the number of positive training
semantic relations between terms and documents. Then, LSA clas- set per categories are small.
sies semantically related web pages, and WPFS extracts four text In [25], a hybrid algorithm based on Variable Precision Rough
features from the web page content. The category of a web page Set (VPRS) is proposed, which combines the strength of KNN and
can be correctly determined by the WPFS. [19] Presented an algo- Rocchio techniques to overcome their weaknesses. Firstly, feature
rithm based on Cost-Sensitive Support Vector Machine (CS-SVM) to space of training data is partitioned using VPRS, and lower and
improve the classication accuracy. During the training process of upper approximations of each category are dened. Then KNN and
CS-SVM, different cost factors are attached on the training errors to two Rocchio classiers are built on these new subspaces respec-
generate an optimized hyperplane. Experiments have shown that tively. The two Rocchio classiers are used to classify most of the
CS-SVM outperforms SVM on the standard ODP dataset. new documents effectively and efciently. KNN is used to nd
In [20], the problem of feature selection has been highlighted; nearest neighbors of the new document in the subset of train-
the aim is to nd a subset of features for optimal classication. ing dataset, which can save time obviously compared with nding
A critical part of feature selection is to rank features according to nearest neighbors in the whole training dataset. Experimental
their importance for classication. [26] Has developed a new fea- results indicate that the proposed hybrid algorithm achieves sig-
ture scaling method, called classdependentfeatureweighting nicant performance improvement. [26] Proposed a model for text
(CDFW) using naive Bayes (NB) classier. A new feature scaling categorization that concentrates on the underlying meaning of
method, which is called; CDFWNBRFE, combines CDFW and words in their context (i.e., concentrates on learning the meaning of
recursive feature elimination (RFE). In [21], an auxiliary feature words, identifying and distinguishing between different contexts of
method is proposed. It determines features by an existing fea- word usage). This model can be summarized in the following steps:
ture selection method, and selects an auxiliary feature, which can rst, it maps each word in a text document to explicit concepts.
reclassify the text space aimed at the chosen features. Then, the cor- Then, it learns classication rules using the newly acquired infor-
responding conditional probability is adjusted in order to improve mation; nally, it interleaves the two steps using a latent variable
classication accuracy. model. The proposed model combines Natural Language Process-
In [22], a hybrid k-nearest neighbor (KNN) and SVM classiers ing techniques such as word sense disambiguation, part of speech
for multiclass classication of gene expression data has been intro- tagging, with statistical learning techniques such as Nave Bayes in
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 185

order to improve the classication accuracy and to achieve robust- cation features, this will certainly mislead the classier as it may
ness with respect to language variations. give the same label to both pages (e.g., classify them to the same
In [27], a new text classier is proposed by integrating the near- class). However, if Java is considered as an ambiguous keyword,
est neighbor (NN) and SVM algorithms. The proposed SVM-NN WSD can be used to disambiguate it into its correct sense, and then
approach aims to reduce the impact of parameters in classication classify the page accordingly.
accuracy. In the training stage, SVM is used to reduce the training As the nearby keywords can provide strong consistent clues
samples for each of the available classes to their support vectors to the sense of the ambiguous keyword, D2 O maintains a list of
(SVs).The SVs from different classes are used as the training data of discriminative keywords for each domain ambiguous keyword,
nearest neighbor classication algorithm in which the nearest cen- which is called Partners List (PL). Then, rely on those partners
troid distance function is used to calculate the average distance to sense the correct meaning of the polysemous (ambiguous)
instead of Euclidean function, which reduce time consumption. keyword. However, selecting those partners is a true challenge,
[28] Proposes hybrid classiers involving various two-classier and which will be claried through the next sections. Hence, when
four-classier combinations for two-level text categorization. It classifying a new page P, which includes an ambiguous key-
shows that the classication accuracy of the hybrid combination is word Kamb , initially PL(Kamb ) is identied with the aid of D2 O,
better than the classication accuracies of all the corresponding sin- then, the correct meaning of Kamb is sensed accordingly based on
gle classiers. The constituent classiers of the hybrid combination the absence/existence of keywords PL(Kamb ) in the tested page
operate on different subspaces obtained by semantic separation of (e.g., P). Afterword, a decision can be taken whether to include
data. Experiments show that dividing a document space into dif- Kamb as a feature during the classication process or not. The
ferent semantic subspaces increases the efciency of such hybrid next subsections will illustrate in details; D2 O construction, the
classier combinations. methodology used to identify the ambiguous keywords, and the
[29] Introduced a new crawler architecture, which is called procedure followed to elect the partner list for each ambiguous
Treasure-Crawler (TC). According to TC, a new methodology that keyword.
employed specic HTML elements of the input page was employed
to predict the target topical domain of each page that has an
unvisited link inside that input page. Then, only those on-topic
pages are sorted based on their relevancy the crawlers domain of 5.1. D2 O construction
interest for further actual downloads. In TC, a hierarchical struc-
ture called T-Graph was employed, which assigns the appropriate Algorithm 1 shows the procedure for D2 O construction that con-
priority score to each unvisited link. Then, these URLs will be sists of four steps, which are; (i) Conceptualization, (ii) Concept
downloaded later based on their pre-assigned priorities. In [30], Weight Calculation, (iii) Inter-Concepts Relationships Assignment,
an effective focused crawler had been developed, which is called and (iv) Graph Construction.
OntoCrawler. It provide a semantic level solution that provides fast, In the rst step of D2 O construction (e.g., Conceptualization),
precise, and stable query results based on ontology-supported web- the domains keywords are collected by a domain expert from
site models. Hence, OntoCrawler can benet both user requests highly domain related web pages, which can be represented by
and domain semantics. It has practically applied on Yahoo and the set K = {k1 , k2 , .,kg }. Then, synonymous are grouped into one
Google search engines to actively search for webpages of related cluster, so that each cluster represents distinct domain concept.
information. Experimental results had shown that OntoCrawler Then, the most popular keyword in each cluster is selected to be
could denitely promote both precision and recall rates of webpage the concepts Representative Term RT, while the remaining cluster
query. members are the RTs synonyms. Finally, after conceptualization,
DOI is expressed by a group of concepts, which are represented by
the set C = {c1 , c2 , c3 , . . .., cn }.
5. Disambiguation domain ontology (D2 O) In the second step (e.g., Concept weight calculation), the weight
of each domain concept is calculated, denoted as; w(ci ) i {1,2,
In this section, a proposed structure for domain ontology will . . .,n}. This weight is calculated using the web pages collected
be presented, which will be used for mapping domain keywords from the considered domain corpus using the Odd Ratio Numerator
to the corresponding domain concepts as well as achieving key- (OddN) [31] method as calculated (1).
word disambiguation. Hence, it is called Disambiguation Domain
Ontology (D2 O). With the aid of Wordnet, D2 O organizes the con-
sidered domain keywords into groups so that each group consists w (ci ) = OddN (ci ) = tpr (c) [1 fpr (ci )] (1)
of a set of synonymous keywords that are used to express a
unique domain concept. For dimensionality reduction, one key-
word, from each group, is selected to express the underlying
concept, which is called concepts Representative Term (RT). In where, tpr (ci ) = tp (ci ) /pos and fpr (ci ) = fp (ci ) /neg
addition to Synonymous relation, D2 O maintains several rela-
tions among the domain keywords that indicate the semantic
strength among each domain keyword and other keywords Where tpri is the sample true positive rate of concept ci , which is
D2 O. calculated as; tpr (ci ) = tp (ci )/pos, and fpri is the sample false pos-
To the best of our knowledge, all web page classication tech- itive rate of concept ci , which is calculated as; fpr(ci ) = fp(ci )/neg,
niques are mainly relying on bag-of-words representation for the tp(ci ) expresses the true positive of the domain given concept ci ,
input page. However, employing individual keywords as features which is the number of positive pages for the domain(e.g., pages
may lead to features ambiguity, which is called; the polysemy effect that were already related to DOI) containing the concept ci , fp(ci )
[9]. This happened as some keywords may be shared among several expresses the false positive of the domain given concept ci , which is
domains and accordingly have several meanings. For illustration, the number of negative pages for the domain(e.g., pages that were
suppose that page X includes the keywords; Java, Programming, not related to the domain) containing the concept ci , pos is the
Computer, while another page Y includes the keywords; Java, number of positive pages of the domain, and neg is the number of
Cup, Coffee. If the keywords are used directly as the classi- negative pages of the domain.
186 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

In the third step (e.g., Inter-Concepts Relationships Assignment),


the relation between every pair of concepts, denoted as; r(cx ,cy )
In this step, the relationship r(cx ,cy ) between each con-
cx , cy C is calculated using the WebOverlap coefcient [32] as
cept pair will be calculated by using (2). For illustration,
illustrated in (2).
consider the concept pair (A,B). As the query of A + B
     
r ci , cj = WebOverlap ci , cj = N ci cj /min (N (ci ) , N (ci )) (2) appears in 642,000,000 pages in Google, and for the query
of A and for B are 25,270,000,000 and 10320,000,000
Where; N(c) is the number of retrieved pages from Google that con- pages respectively, so r(A,B) = 642,000,000/min(25,270,000,000,
tain the concept c and N (cx cy ) is the number of retrieved pages 10320,000,000) = 642,000,000/10320,000,000 = 0.06. And continue
from Google that contain both concepts cx and cy . In the fourth the calculations for the rest of the domain concepts as shown in
step (e.g., Graph Construction), the domain concepts are arranged Fig. 3.
in a graph like structure called the disambiguation domain ontology
(D2 O), considering the inter-concept relationships between every C Graph Construction
pair of concepts and the concepts weights.
Graph construction is the nal step at which all domain concepts
5.2. D2 O generation: illustrative example are arranged together in the form of a weighted conceptual graph
that representing the domain of interest as illustrated in the last
In this section, an illustrative example showing how to construct step of Fig. 3.
a simple D2 O will be introduced. The task is to generate a new ontol-
ogy that represents a specic domain by following the proposed 5.3. Identifying ambiguous keywords (Confusion set)
procedure illustrated in algorithm 1.
In order to solve the polysemy dilemma, after setting up D2 O,
it is essential to identify the set of keywords that need to be dis-
A Conceptualization and Calculating the Concepts Weights
ambiguated. This set of keywords, which is called Confusion Set
(CS), may cause serious problems as it confuses the classication
As depicted in Fig. 3, domain concepts are identied. Then, algorithm. As manual identication of CS is a tedious and subjective
the Odd Ratio Numerator (OddN) method is used to calculate the issue, through this section, an automatic methodology for identi-
weight of each concept w(ci ) using (1). The used positive pages fying CS will be illustrated in details.
(e.g., pos) = 10, also the negative pages (e.g., neg) = 10. Those nega- WordNet, as a valuable lexical resource, has attempted to model
tive pages are the ones that contain several domain concepts but the knowledge of a native speaker of English. However, identify-
are not related to the domain. ing the ambiguous set by directly employing WordNet has resulted
in too many senses for almost all considered domain keywords.
B Inter-Concepts Relationships Assignment Generally, WordNet maintains a semantic relation among key-
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 187

1
Calculate
Filter pages Inter-concepts
contents 4 3 relationships
Page ID Page Contents
p1 A,B,E
p2 A,B,C Graph Construction Concepts # of Retrieved
Concepts
p3 A,B,D Pair
p4 A,D,E Pair Web Pages
Relation
p5 A,E,F
p6 B,D,E A,B 642,000,000 0.06
p7 B,C A,C 313,000,000 0.02
p8 C,D,E r (AC) A,D 249,000,000 0.03
p9 B,E A,E 116,000,000 0.01
r (A B)
p10 B,E,F A A,F 86,200,000 0.01
p11 C,D B,A 234,000,000 0.02
p12 D,F B,C 498,000,000 0.05
r(A E) r(BC) B,D 59,000,000 0.01
p13 A,F B
r(AF) B,E 72,200,000 0.01
p14 A,C,E r (AC)
p15 E,F B,F 37,500,000 0.01
p16 A,C,D r(B F) C,A 218,000,000 0.01
p17 A,E C,B 55,000,000 0.01
r(B D)
C C,D 496,000,000 0.03
p18 D,E,F F
p19 B,C,D C,E 101,000,000 0.01
P20 A,F r(D F) r(B E) C,F 61,500,000 0.01
D,A 250,000,000 0.02
r(C D)
r(E F) D,B 59,200,000 0.01
D D,C 77,800,000 0.01
D,E 468,000,000 0.03
D,F 49,600,000 0.01
r l(CE)
2 E E,A 535,000,000 0.03
r (D E)
E,B 50,100,000 0.01
E,C 84,900,000 0.01
E,D 92,100,000 0.01
Calculate r(C F) E,F 399,000,000 0.05
Concept Weight F,A 157,000,000 0.02
F,B 33,300,000 0.01
Concept Weight F,C 51,200,000 0.01
A 0.3 F,D 37,600,000 0.01
B 0.63 F,E 48,500,000 0.01
C 0.18
D 0.2
E 0.42
F 0.1

Fig. 3. D2 O Generation in the Illustrative Example.

word senses by grouping keywords into the same semantic domain in Fig. 4, FBAI consists of two sequential phases, which are; (i)
(Education, Sports, Medicine, etc.). However, too many senses may Pre-Processing (PP) and (ii) Fuzzy Inference Engine (FIE).
be considered for the same keyword, which may be harmful and
time consuming for the classication task. For illustration, the key- 5.3.1. Pre-Processing (PP)
word bank, has ten senses in WordNet 3.1, in which three of During PP, the task is to calculate three different parameters for
them, namely; bank#1, bank#3 and bank#6, are grouped the input keyword k D2 O, which are; (i) Domain Relatedness of
into the same domain label, which is Economy. On the other k, denoted as; DR(k), (ii) Keyword Popularity, denoted as; KP(k),
hand, bank#2 and bank#7 are grouped into the labels Geog- and (iii) domain Information Content of k, denoted as; IC(k). For
raphy and Geology respectively. Moreover, WordNet considers calculating those parameters for k, initially, Google is queried to
also the rare senses of the word, for illustration, the word java retrieve the  pages that contains k, which are expressed by the
has three senses in WordNet 3.1, hence, java is shared between set Pages(k). Then, for each retrieved page P Pages(k), the set of
three domains, which are; Food with the sense coffee, Commu- snippets that contains k are identied and combined together to be
nication with the sense object oriented programming language, a representative for the page P. A snippet is dened as the set of
and Location with the sense an island in Indonesia. However, keywords around k in the page P, hence, the center of the snippet is
the second sense is the most widely used one, hence, as WSD is the considered keyword k. The snippet has two wings with length
expensive, it is unnecessary to disambiguate every domain key- , hence, each wing is represented be words in one side of k. The
word appeared in the tested page. Too much WSD may increase the collected snippets of k of page P are combined together and used to
classiers risk of overtting as WSD is not always accurate. From classify P using traditional NB classier either to the domain (class
another point of view, WordNet has a subtle distinction between D+ ) or not (class D ).
keyword senses, which could be harmful for the classication task. This process is continued until all pages Pages(k) have been
For illustration, WordNet distinguishes between bass (the lowest classied. Then, the number of pages D+ as well as the number
part in polyphonic music) and bass (the lowest part of the musi- of pages D are calculated, denoted as; P+ (k) and P (k) respec-
cal range). Hence, relying completely on WordNet for identifying tively. Moreover, the frequency of k is calculated in all snippets
CS will not be a good decision. In this section, a simple but effec- collected from all pages D+ , denoted as; TF+ (k), and also in all snip-
tive methodology will be introduced for identifying the subset of pets collected from all pages D , denoted as; TF (k). Also consider
keywords that most likely confuse the classier, which is called the number of occurrences of k in the considered domain corpus,
Fuzzy Based Ambiguity Identication (FBAI) Strategy. As illustrated denoted as; N(k), as well as the total number of occurrences of other
188 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 4. Identifying Ambiguous Keywords.

domain keywords, denoted as; NT , Finally, DR(k), KP(k), and IC(k) domain information as it tends to be more specialized and con-
are calculated using (1), (2), and (3) respectively. veys more independent meaning, and accordingly it owns a high IC
TF + (k) TF
value. Finally, after calculating DR(k), KP(k), and IC(k) k D2 O, the
DR(k) = (k) (1) Confusion Set (CS) can be identied through a fuzzy inference pro-
P+ cess. Generally, the higher the domain relevancy, popularity, and
KP(k) = (k)P (k) (2) information content of a keyword, the lower its ambiguity level.
 
IC(k) = log P(k) = log N(k)NT (3)
5.3.2. Fuzzy inference engine (FIE)
Denition 1. Keyword Domain Relevancy, denoted as; DR(k) mea- Fuzzy inference is suitable for approximate reasoning as it can be
sures the degree of relevancy of a keyword k to the Domain of interest used efciently in decision making under incomplete or uncertain
and is dened as the percentage of occurrences of k in the positive pages data. Hence, fuzzy inference can be successfully employed to assign
(e.g., pages related to the domain) to occurrences of k in the negative an ammbiguation value for a specic keyword. Generally, fuzzy
ones (e.g., pages unrelated to the domain). inference can be applied through three sequential steps, which
are; (i) fuzzication of inputs, (ii) Fuzzy Rule Induction, and (iii)
Denition 2. Keyword Domain Popularity, denoted as; KP(k) mea- defuzzication.
sures the popularity of a keyword k in the considered domain against
other domains, and is dened as the percentage of positive pages that
a Fuzzication of Inputs
contains k to the negative ones.

Denition 3. Keyword Domain Information Contents, denoted as; Three different fuzzy sets, which are; DR, KP, and IC will be con-
IC(k) measures ks ability to convey domain information. It mea- sidered. During fuzzication, the input crisp values are mapped
sures the amount of domain knowledge gained by the keyword into grades of membership for linguistic terms, Low and High
occurrence. of the used fuzzy sets. The employed membership functions for the
As illustrated in denition 3, a highly recurring keyword con- considered fuzzy sets (e.g., DR, KP, and IC) are illustrated in Fig. 4.
veys little information due to its ubiquitous use, and then it has a
small IC. On the other hand, a rare keyword conveys much more b Fuzzy Rule Induction
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 189

Table 2 5.4. Identifying partner list (PL) for each ambiguous keyword
The used fuzzy Rules.

ID DR KP IC Rule output After domain CS has been identied, the task is to choose PL(k)
1 L L L L k CS. A partner of a keyword k is a nearby domain keyword that
2 L L H L provides a strong consistent clue to the sense of k with respect to
3 L H L L the domain of interest (DOI). Hence, a good way is to choose the
4 L H H H closest keywords to the ambiguous keyword k in D2 O to represent
5 H L L L
the partners of k (e.g., PL(k)). To accomplish such aim, as depicted in
6 H L H H
7 H H L H Fig. 7, a circle of center k is drawn over D2 O with radius , which is
8 H H H H called; Partner Circle (PC). Hence, represents the number of hops
that separates k from the farthest partners in PL(k). However, not all
the keywords inside PC are considered as partners of k. Only those
keywords that are related to k with an acceptable strength are con-
sidered as partners. Hence, we rely on the relation strength among
k and those keywords inside PC(k) that were previously calculated
during D2 O construction. A threshold strength value is assumed as
, then any a keyword M is considered as a partner of k if r(k,M) .
Also, as illustrated in Fig. 7, if a keyword is rejected from the partner
list of a keyword, its directly connected keywords are also rejected.

6. The proposed domain distiller

As shown in Fig. 8, the input for the proposed distiller is a


web page to be classied. The distiller then analyzes the page
and extracts existing domain keywords using the Disambiguation
Fig. 5. The used identication function. Domain Ontology (D2 O), which maintains all available domain key-
words. However, before mapping extracted domain keywords to
the corresponding concepts, a disambiguation process is done on
For the inference process, a set of fuzzy rules are employed in the extracted domain keywords to discard those ambiguous key-
the form if (A is X) AND (B is Y) AND (C is Z) . . .. . . THEN (D is M), words which are not related to the domain.
where A, B, and C represent the input variables (e.g., DR, KP, and IC), As the target of the proposed distiller is to decide whether the
while X, Y, and Z represent the corresponding linguistic terms (e.g., input page is related to the domain of interest or not, the core of our
low or high), D represents the rule output, and nally M represents a distiller is a binary classier. However, to perfectly accomplish the
linguistic terms (low or high). Hence, the output of the fuzzication distillation process, as illustrated in Fig. 8, the proposed distiller is
is the input for the fuzzy rule induction. There are 8 rules, which are divided into two sequential modules, which are; analysis module
listed in Table 2 (assuming L refers to Low, H refers to High). (AM) and classication module (CM). During AM, the input page is
For illustration, the rst rule in Table 2 indicates that; IF DR(k) is expressed in a vector space model. On the other hand, CM takes the
Low AND KP(k) is Low AND IC(k) is Low THEN Output is Low. decision whether to accept or reject the page using an optimized
Generally, there are four fuzzy rules inference methods, namely; Nave Bayes classier. These modules will be explained through the
max-min, max-product, sum-dot, and drastic product. The max- next subsections.
min is the used method in this paper. It is based on choosing the min
operator for the conjunction in the premise of the rule and for the 6.1. Analysis module (AM)
implication task, while the max operator is used for aggregation.
Hence, for w input variables and q states of the output linguistic The target of AM is to represent the input page in the vector
terms, the max-min inference rule can be illustrated in (4). space model. To accomplish such aim, domain keywords found
in the page will be extracted and mapped to the corresponding

aggregation domain concepts using the aid of the proposed D2 O. As illustrated
   before, if the considered Domain Of Interest (DOI) has k keywords,
out (x) = MAX MIN x {1, 2, 3, ...., q} (4)
  inp(1) , inp(w) these keywords are clustered using WordNet. For each cluster syn-
onymous keywords are grouped and represented by one domain
implication
concept (e.g., the underlying concept), which is the frequently used
term of the clusters terms, hence, it is called the concepts Rep-
c Defuzzication resentative Term (RT) and the remaining clusters terms are RTs
synonyms.
Considering an input page P, after extracting the pages domain
Defuzzication was accomplished using the output member- keywords, expressed by the set Keywords(P), ambiguous keywords
ship function illustrated in Fig. 4. Hence, consider a keyword k are identied with the aid of the proposed D2 O. Disambiguation is
whose input parameters are; DR(k); KP(k); and IC(k). The output done by discovering the correct sense of the word by looking for the
value of the defuzzication process is a crisp value that expresses partners of the ambiguous keyword inside the processed page (e.g.,
the Ambiguation Value (AV) of the keyword k; e.g.; AV(k). Finally; P). Considering an ambiguous keyword kamb , The existence of the
the decision is taken whether k is an ambiguous or not based on kamb s partners gives a good conclusion that kamb gives the correct
a simple rule; which is expressed by the simple step identication domain sense. Hence, to disambiguate kamb , initially, Partners(kamb )
function illustrated in Fig. 5. On the other hand; Fig. 6; illustrates a are identied, then the set RES = Keywords(P) Partners(kamb ) is
simple example considering the keyword k; with the corresponding constructed. Finally, the strength of the relations among kamb and
parameters DR(k); KP(k); and IC(k) are; 3; 3.5; 7 and respectively. all keywords RES are indicated from D2 O, then the average of
190 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 6. Illustrative example showing how to calculate the importance value of a cached item using a fuzzy inference system.

Fig. 7. Identifying the partner list of an ambiguous keyword assuming =2.

strengths for the identied relations is calculated. Then, the deci- An ambiguous keyword is considered as a domain keyword if the
sion whether the sense of kamb is in DOI label, and then kamb is average of the strengths of the relations of the partners of kamb
considered as a domain keyword is taken based on Rule 1. that exists in the processed page is greater that a threshold value,
denoted as; . This can be expressed by the following expression.
Rule 1: Disambiguation Rule
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 191

Fig. 8. The Proposed Distiller Framework.

focused crawler in the wrong direction when embedding such dis-


tiller in an EFC. Rejecting outliers can be accomplished in two steps,
If [Average(Relations Strength(kamb k))k RES]
which are; (A) SVM Optimization, and (B) Selection of informative
ThenSense(kamb ) DOIElseSense(K amb ) /inDOI pages.

After removing those keywords that are not related to DOI, (A) SVM Optimization
the remaining domain keywords are mapped into the correspond-
ing domain concepts using D2 O. Afterword, the input page can be Genetic algorithm (GA) is one of most effective, powerful, and
represented in the form of a vector of domain concepts. The dimen- unbiased heuristic search approach in the area of Articial Intelli-
sion of that vector equals the number of the considered domain gence (AI). It guarantees an optimized solution for a given problem
concepts. Hence, if the concept is found in the input page, the cor- based on several approaches such as; mutation, inheritance, and
responding place of that concept in the vector is set to 1, otherwise, selection. GA has several advantages, which makes it as one of
it is set to 0. At the end of this stage, an input page can be expressed the most common search algorithms in AI, such as; (i) GA has the
as a vector of ones and zeros, which is called the Vector Space Model. ability to solve any optimization problem based on chromosome
As our domain has a set of n concepts C = {c1 , c2 , c3 . . ., cn }, so the approach, (ii) It can handle multiple solution search spaces as well
vector of the input page will be of n-dimensions. as perfectly solving the given problem in such an environment (iii)
GA is less complex and more straightforward compared to classical
6.2. Classication module (CM) algorithms, (iv) it is easier to be transferred and applied in differ-
ent platforms, thereby increasing systems exibility, (v) GA has the
This module is a binary classier, in which pages will be clas- ability to support multi-objective optimization, and accordingly it
sied into only two classes. The processed web page may relate is a good choice for noisy environments. Moreover, GA has the
to our DOI or it may not. Related pages are classied to Class ability to nd a global optimum.
1, while the others are classied to Class 2. In CM, a new of On the other hand, GA differs from traditional search and opti-
binary classier is proposed by integrating an optimized Support mization methods in several signicant points, which make it
Vector Machine (SVM), by the use of Genetic Algorithm (GA), and a perfect choice to be implemented in our work such as; (i) it
Nave Bayes (NB) algorithms. Although the optimization was done searches in a parallel manner through the population. Accordingly,
on SVM, such process also optimizing the behavior of NB. Hence, GA avoids being trapped in local optimal solution such as traditional
the new binary classier, which is the core of the proposed dis- techniques, which search from a single point. (ii) GA implements
tiller, is called Optimized NB (ONB) classier. Initially, ONB rejects probabilistic selection rules, not deterministic ones. (iii) GAs work
outliers by selecting the most informative examples using an opti- on the Chromosome, which is an encoded clone of potential solu-
mized SVM (with the aid of GA). Hence, after rejecting outliers, the tions considered parameters, rather the implemented parameters
remaining pages (examples) will be used to train the NB classier. themselves. (iv) GAs employ a tness score, which is derived from
Finally, the decision if an input page is related to Class1 or Class objective functions, with no derivative or auxiliary information. (v)
2 will be taken using NB classier for testing. ONB operates in three GA requires no explicit expression for the solving model; rather it
phases; which are; (i) Outlier Rejection (ii) Training, and (iii) Test- needs the tness function and related variables. (vi) GA can effec-
ing. More details about those three steps will be introduced in the tively handle arbitrary types of constraints and objectives. (vii) GAs
next sub-sections, also they are depicted in algorithm 2. complexity is almost linearly correlated with the scale of the con-
sidered problem, hence, no chance to have dimension disaster.
6.2.1. Outlier rejection In this step (e.g., SVM Optimization), genetic algorithm (GA) is
This step aims to discard the false examples that may result in used to improve the performance of the traditional SVM. GA will
constructing wrong classication rules during the next step, which be used to optimize a specic parameter of SVM, which is the soft
is the classier training. Surely, constructing wrong rules will badly margin parameter C [33]. Parameter C is used during the training
impact the performance of the distiller, which in turn directs the of SVM and indicates how much outliers are taken into account in
192 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 9. The region around SVM Hyperplane.

calculating support vectors. To implement our proposed approach, selection probability of the ith chromosome. Then a random number
this paper used the linear kernel function for the SVM classier as (Rn) from interval (0, SumP) is generated as expressed in (8).
the linear kernel function is the most suitable one for text.
The optimization procedure can be divided into ve processes, Rn = Rand(0, SumP) (8)
which are; (i) Population, (ii) Evaluation, (iii) Encoding, (iv) Selec-
Where, Rn is a random number between 0 and SumP, and SumP is
tion, and (v) Crossover. In population process, different values for
the sum of all probabilities of all chromosomes in the population.
the parameter C are assumed as chromosomes, which are the pos-
Then, return again to the considered population to accumulatively
sible solutions of the problem. Then, in evaluation process, for each
sum the probability values from 0 to SumP. While summing, if SumP
chromosome, that representing parameter C, there is a tness func-
reaches a value that is greater than or equal the random number Rn
tion used to evaluate its tness (Ft) by using the training dataset to
(e.g., SumP > = Rn), the process should stop and select this chromo-
train the SVM classier, then use a testing dataset to calculate the
some. The crossover process is used to interchange genes between
accuracy as depicted in (5).
chromosomes to create offsprings using Single Point (SP) technique
CorrectAssignments Ai + Ci [35]. In this technique, rstly, again a random number Rn[i] from
Fti = Acci = = (5)
TotalPages Ai + Bi + Ci + Di 0 to SumP should be generated. Then, if Rn[i] is smaller than the
crossover probability (Pc), then the chromosome i will be chosen
Where, Fti is the tness value of the ith chromosome, Acci is the as a parent. The crossover point within a chromosome is chosen
accuracy of SVM classier when using the ith chromosome, Ai is randomly, and then the two parent chromosomes are interchanged
the number of pages that are assigned correctly when using the ith at that point to produce two new offspring.
chromosome, Bi is the number of pages that are assigned incorrectly
when using the ith chromosome, Ci is the number of documents that
(B) Selection of informative pages
are rejected correctly when using the ith chromosome, and Di is
the number of documents that are rejected incorrectly when using
the ith chromosome. Then, in encoding process, the solutions can SVM has several salient properties that other techniques do not
be encoded as a binary numbers. On the other hand, during the have such as; (i) it guarantees accurate classication, even if the
selection process, the better solutions are selected using Roulette input data is non-linearly separable, (ii) the classication accuracy
Wheel Selection (RWS) technique [34] as it is the most simple selec- does not depend on the expertise of choosing of the kernel func-
tion technique in implementation. Hence, the probability of each tion (in the case of non-linear input data), (iii) as SVM has only two
chromosome will be selected is calculated as in (6) free parameters, which are; kernel function and the upper bound,
x it can be easily controlled, (iv) SVM insures the existence of global,
Pi = Fti / Fti (6) unique, and optimal solution since its training is equivalent to solv-
i=1 ing a linearly constrained Quadratic Programming, (v) SVM has the
Where Pi is the selection probability of chromosome i, x is the ability to adapt its learning characteristic using the kernel function,
number of all chromosomes in the population, and Fti is the t- also it owns a good ability to adequately classify data even in a high-
ness value of chromosome i. then the sum of all probabilities of all dimensional feature space with little training data, (xi) SVM works
chromosomes is calculated as depicted in (7) well on real-world applications, moreover, it can be successfully
x applied to any kind of data since a kernel is available.
SumPi = Pi (7) In this step, the optimized SVM is used to select the most
i=1
informative examples from the available ones. When the distribu-
Where, SumP is the sum of all probabilities of all chromosomes, x tion of the available training examples between the two classes is
is the number of all chromosomes in the population, and Pi is the inspected, as illustrated in Fig. 9, it will be clear those examples that
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 193

Table 3
The rst generation.

Population, Evaluation and Encoding Selection Crossover

ID Chromosome Binary Fitness Probability Cumulative SumRandom NumberNew ID New Random NumberNew
Encoding Value (Ft) (P) probability (Rn) Population (Rn) Population
(SumP)

1 0.5 00.1 0.66 0.18 0.18 0.84 5 2.5 0.28 0.5


2 1 01.0 0.71 0.19 0.37 0.02 1 0.5 0.55 0.5
3 1.5 01.1 0.69 0.18 0.55 0.75 5 2.5 0.96 2.5
4 2 10.0 0.7 0.19 0.74 0.42 3 1.5 0.97 1.5
5 2.5 10.1 0.5 0.13 0.87 0.35 2 1 0.16 3.5
6 3 11.0 0.5 0.13 1.0 0.88 6 3 0.95 3
Total 3.76 1.0

are far from the hyperplane are highly related to their classes. This cl2 is the opposite class. During NB training, the task is to calcu-
is because the hyperplane separate between the two classes, hence late the conditional probabilities P(ci |clj )ci D C, clj D CL, and the
those examples that are closed to the hyperplane may be related classes prior probabilities P(clj ) clj D CL as illustrated in (9) and
to both classes or their assignments may have errors; hence some (10) respectively.
of them may be assigned incorrectly. Accordingly, they are not dis-
P(clj ) = Pgj /Pg (9)
criminative examples and may lead to obsolete classication rules
during the training phase. A good behavior is to eliminate those P(ci |clj ) = Ni,j /Nj (10)
ambiguity examples. However, to keep as much training exam-
Where; Pgj is the number of pages related to class clj , and Pg is
ples as possible, such elimination should be controlled. A simple
the total number of pages related to all domain classes, Nij is the
approach, which is the followed one, is to eliminate the support
number of occurrences of concept ci in pages of class clj , and Nj is
vectors as they are the most ambiguity examples as illustrated in
the total number of concepts in pages of class clj .
Fig. 9.

6.2.3. Testing phase


6.2.2. Training phase
During the testing phase, the decision whether an input page is
NB has proven to be the most effective classier in the area of
related to the DOI or not will be taken. If its related, the page will be
text classication. It has several advantages such as; (i) NB has
targeted to cl1 . Otherwise, it will be targeted to cl2 , hence, it will
a short computational time for training with the assumption of
be rejected. During the testing phase, initially, the probability that
independent features, (ii) It is easy to be implemented and often
pi related to each class is calculated (also called the likelihood of being
has superior performance, (iii) applying NB requires low resource
in the class) based on the domain concepts extracted from pi . If pi
requirements in terms of time and memory during the testing
contains z domain concepts, the likelihood of being in class clj clj
phase, hence, it is suitable for use by focused crawlers that need
CL can be calculated as expressed in (11). Then, pi will be targeted
to take the classication decision on time, and (iv) NB is robust in
to the class (cltarget ) with the maximum calculated likelihood, using
noisy environments, moreover, it requires little amount of training
(12).
data.
During the training phase, traditional NB classier is trained
z
Likelihood(pi |clj ) = P(clj ) P(cx,i clj ) (11)
using the most informative examples that were selected during the x=1
previous phase. Hence, the simple binary NB classier is employed, z
which has two target classes represented by the set CL = {cl1 ,cl2 }, Target(pi ) = cltarget = arg max(P(clj ) P(cx,i clj )) (12)
x=1
where cl1 is the class including those pages related to DOI, while
194 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

In this step, different values for the parameter C are assumed as


6.3. SVM optimization: illustrative example
chromosomes as shown in Table 3, column 2.
The optimization procedure can be divided into ve steps: (i)
Population, (ii) Evaluation, (iii) Encoding, (iv) Selection, and (v) (ii) Evaluation
Crossover. Here are some assumptions:
In this step, for each chromosome, the tness value (Ft) is
calculated using tness function. The tness value is calculated
No. of chromosomes (size of population) = 6
by training dataset with the SVM classier and using the testing
No. of generation = 50
dataset to calculate the accuracy as tness function as in (5) as
Crossover probability (Pc) = 0.8
illustrated in Table 3 column 4.

(i) Population (iii) Encoding


A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 195

Table 4
The second generation.

Population, Evaluation and Encoding Selection Crossover

ID ChromosomeBinary Fitness Probability Cumulative Random New New Random New


Encoding Value (P) Sum Number (Rn) ID population Number (Rn) Population
(Ft) probability
(SumP)

1 0.5 00.1 0.66 0.18 0.18 0.79 5 3.5 0.36 3.5


2 0.5 00.1 0.66 0.18 0.36 0.96 6 3 0.44 2
3 2.5 10.1 0.5 0.13 0.49 0.66 4 1.5 0.96 1.5
4 1.5 01.1 0.69 0.19 0.68 0.04 1 0.5 0.65 2.5
5 3.5 11.1 0.71 0.19 0.87 0.85 5 3.5 0.85 3.5
6 3 11.0 0.5 0.13 1 0.33 2 0.5 0.97 0.5
Total 3.72 1.0

Table 5
The used tunable parameters through the experiments.

Parameter Assigned value Description

n 485 Number of domain concepts (Popular terms only).


t 1305 Number of domain keywords (popular and un-popular terms).
CL 2 Number of classes: Cl1 is related to DOI, Cl2 otherwise.
Training web pages 10000 From Web Data Commons dataset.
Testing web pages 500
Kernel Function Linear K(xi ,xj ) = xi .xj xi xj
C From 0.55 Soft margin parameter
x 100 Number of chromosomes of the population
Pc 0.8 Crossover Probability
No. of generations 50 Number of generations or iterations.
2 Number of hops that separates keyword k from the farthest partners in its Partner List PL(k).
0.2 Threshold relation Strength between the keyword K and its partners.
100 Number of retrieved pages that contain keyword K from Google.
50 Length of wing of snippet in page P that contains keyword K.

Fig. 10. Data Structure used to.

In this step, the chromosomes can be encoded in a binary rep- lation and compute the cumulative sum of the probabilities, and if
resentation as illustrated in Table 3 column 3. SumP is greater than the random number (e.g., SumP > = Rn), then
stop and select this chromosome, do this 6 times, so then the new
(iv) Selection population will be as in Table 3 column 9.

In this step, the Roulette Wheel Selection (RWS) technique is (v) Crossover
used. First, the probability (P) of each chromosome will be selected
is calculated using (6) as in Table 3 column 5, then the sum of all As illustrated in algorithm 2, rst generate a random number
probabilities (SumP) of all chromosomes is calculated using (7), from 0 to 1 as shown in Table 3 column 10, which is Rn[i] where i = 1,
then a random numbers (Rn) from interval (0, SumP) is generated 2, . . ., 6. And then if this random number Rn[i] is smaller than the
using (8) as in Table 3 column 8. Then go again through the popu- crossover probability Pc then select the chromosome i as a parent
196 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 11. Precision in all used techniques.

Fig. 12. Accuracy of all used techniques.

chromosome. As the crossover probability Pc = 0.8, then the parent


chromosomes will be, chromosomes 1, 2, and 5, so we can take
chromosome 1with 2, chromosome 2 with 5, and chromosome 5 Chromosome5 = chromosome5 >< chromosome1
with 1. Then the crossover point is chosen randomly at bit number = 01.1 > < 10.1 = 11.1 = 3.5
2 for the three crossovers as the following:

Chromosome1 = chromosome1 > < chromosome2 Finally, the chromosome population after crossover process is
= 10.1 >< 00.1 = 00.1 = 0.5 illustrated in Table 3 column 11, which will be the population for
the next generation of the genetic algorithm. In the next genera-
tion, we follow the same steps as in the rst generation and the
results are illustrated in Table 4. The population after crossover
Chromosome2 = chromosome2 >< chromosome5 process is illustrated in Table 4 column 11 which will be the next
= 00.1 >< 01.1 = 00.1 = 0.5 population, the process continues until the assumed number of
generation ends. Finally, the tness value of all chromosomes will
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 197

Fig. 13. Error of all used techniques.

Table 6
All possible outcomes of a binary classier.

Classier outcome Description

A # of documents that are assigned correctly


B # of documents that are assigned incorrectly
C # of documents that are rejected correctly
D # of documents that are rejected incorrectly

Table 7
Confusion Matrix.

Actual Class Predicted Class


Fig. 14. Crawling behavior for ideal and traditional focused crawlers.
Positive Negative

Positive A D
be calculated, and then the highest one will be selected as the best
Negative B C
solution.

7. Performance analysis and implementation performance metrics that will be used are illustrated in (13)(15)
[38].
In this section, the proposed distiller, e.g., ONB, which is the
core of CM, will be evaluated against traditional classication tech- PagesAssignedCorrectly A
Precision = P = = (13)
niques, which are SVM, NB, and KNN as well as some recent TotalAssignedPages A+B
classication techniques, namely; Domain Oriented Nave Bayes CorrectAssignments A+C
(DONB) [36] and Domain Oriented KNN (DOKNN) [36]. For each Accuracy = Acc = = (14)
TotalPages A+B+C +D
one, the accuracy, precision, and error will be reported.
The Web Data Commons dataset series contains all structured IncorrectAssignments B+D
Error = E = = (15)
data extracted from the various Common Crawl corpora [37]. Cur- TotalPages A+B+C +D
rently, the extraction process considers the data formats Microdata,
RDFa, and microformats. The documents of Web Data Commons 7.2. Implementing D2 O
were not pre-designated as training or testing patterns. Hence, we
choose some of them as training and testing subsets. 10000 web To speed up the searching, mapping, and conceptualization pro-
sites are randomly selected for training and 500 are used for test- cesses, D2 O is implemented using a database procedure in the form
ing. The parameters that will be used though the next experiments of three related tables as illustrated in Fig. 10. The rst table stores
with the corresponding values are illustrated in Table 5. the used representative terms for the considered domain concepts,
while the second table stores the synonyms terms corresponding
7.1. Performance metrics to each of the representative terms. On the other hand, the last
table stores the partners of each ambiguous keyword as well as the
Table 6 depicts the possible outcomes of a binary classier. They relations among the keyword and the other domain keywords if
will be used to measure the performance of the proposed classier. exist. This table is used mainly for keyword disambiguation. The
And, Table 7 shows the confusion matrix. On the other hand, the Partners eld in the Relation-Partner table is considered as a ag
198 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 15. Harvest rate of TFC and EFC against retrieved pages using STS as a link weighting strategy.

Fig. 16. Harvest rate of TFC and EFC against retrieved pages using Bay as a link weighting strategy.

that is set to 1 if the corresponding related keyword is a partner; and E), will be measured for ONB as well as its competitors using
otherwise its value is set to 0. different values of training pages.
Considering Figs. 1113, basically, by increasing the number of
TPs, the performance of all classication techniques is promoted.
7.3. Evaluating the proposed distiller The reason is that as the number of TPs increases, the classiers
are better trained as they collect more domain knowledge. Hence,
The performance of the proposed distiller (e.g., ONB) is affected it is obvious that performance promotion can be accomplished by
by; (i) the number of training pages (TPs), (ii) the kernel function, training the classier with more TPs. The best precision, accuracy,
and (iii) the value of soft margin parameter C used in SVM train- and error are obtained at the maximum number of training pages
ing. According to [39], the most appropriate kernel function for (e.g., when TPs = 10000).
binary text classication is the linear kernel, especially for large As generally depicted in Figs. 1113, ONB outperforms all other
number of concepts. Based on the Linear Kernel function, through classication techniques. When TPs = 10000, ONBs Precision, Accu-
this section, the pre-mentioned performance metrics (e.g., P, R, Acc, racy, and Error are; 0.86, 0.89, and 0.11 respectively. On the other
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 199

Fig. 17. Harvest rate of TFC and EFC against retrieved pages using STD as a link weighting strategy.

Fig. 18. Harvest rate of TFC and EFC against retrieved pages using TC as a link weighting strategy.

hand, KNN has the worst performance, because it is highly affected Again, consider Figs. 1113. Fig. 11 shows the precision against
by the noise due to outliers that may exist in the training pages, TPs. Generally, the precision for all classiers increases gradually
which have been successfully neglected in ONB by eliminating the by increasing the number of TPs. It is noticed that, ONB has the
support vectors. When TPs = 10000, precision, accuracy, and error highest precision compared with the others. On the other hand, as
of KNN are; 0.7, 0.75, and 0.25 respectively. DOKNN has better illustrated in Fig. 12, ONB introduces higher accuracy than other
performance than KNN. For DOKNN, when TPs = 10000, precision, classiers. ONBs accuracy is gradually improved by increasing the
accuracy, and error are; 0.72, 0.79, and 0.21 respectively. More- number of TPs till reaches 0.89 when TPs = 10000. Finally, Fig. 13
over, NB and DONB have good performance as they are probabilistic depicts the error; it is concluded that ONB introduced signicant
classiers. They both outperform SVM and KNN as they depend on error reduction compared with its competitors. The reason is that
NB theorem. When TPs = 10000, NBs precision, accuracy, and error ONB combines evidence from NB and SVM classiers.
are; 0.78, 0.83, and 0.17 respectively, while, DONBs precision, accu-
racy, and error are; 0.81, 0.86, and 0.14. On the other hand, SVMs
precision, accuracy, and error are; 0.74, 0.81, and 0.19 respectively.
200 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Fig. 19. Relevancy score against retrieved pages of TFC and EFC using STS as a link weighting strategy.

Fig. 20. Relevancy score against retrieved pages of TFC and EFC using Bay as a link weighting strategy.

7.4. Effects of the proposed distiller on the crawling performance CR and accordingly X = . Fig. 14(A) represents the ideal focused
crawler, while Fig. 14(B) represents the traditional focused crawler.
In general, focused crawlers can be evaluated by measuring their The target of the focused crawler is expressed in (16).
ability to retrieve good pages. A good page is the one that is
highly relevant to the domain of interest. As depicted in Fig. 14, all Target(Focused Crawler) = [max(C R)]AND[min(C (C R)] (16)
web pages are represented by the Web set, which is denoted as;
W, while the set of good pages (e.g., relevant to the domain of For illustration, assuming a focused crawler CRL whose domain
interest), is called Relevant set and is denoted as; R. On the other of interest is Education. The set W, which represents all the avail-
hand, the set of Crawled is denoted as; C, which represents the able pages in all domains, are assumed to be 1000 pages in which
pages that the crawler has visited, hence, (R,C)W. The target of the 100 pages are related to Education domain, which represent the
focused crawler is to maximize the set RC, which indicates a large set R. In order to minimize the cost in terms of time and storage
number of good retrieved pages while keeping the set X = C-(RC) as penalties, CRL tries to visit only those relevant pages while dis-
small as possible. An ideal focused crawler, which is unrealistic, has carding irrelevant ones. The set C, which indicates the pages that
the crawler already visits, is assumed to be 200 pages, in which
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 201

Fig. 21. Relevancy score against retrieved pages of TFC and EFC using STD as a link weighting strategy.

Fig. 22. Relevancy score against retrieved pages of TFC and EFC using TC as a link weighting strategy.

only 50 pages are related to Education. Hence |CR| = 50, and |C- Education domain was selected to be the domain of interest. In
(CR)| = 150 accordingly, CRLs precision is 50/200 = 0.25. However, this section, the focused crawling task is done in two different sce-
if CRL crawls the same number of pages (e.g., 200 pages), in which narios. The rst is the Tradition Focused Crawler (TFC) as explained
70 pages are relevant to Education, hence |RC| = 70, then CRLs in section 2.2, while the second is the Effective Focused Crawler
precision is 70/200 = 0.35. Then, one of the targets of CRL is to max- (EFC) which is illustrated in section 3. The crawling queue, in both
imize |CR|. On the other hand, if CRL has the ability to retrieve 50 types of crawlers, is lled with 10 seed pages. Although those seeds
pages related to Education (e.g., |CR| = 50) by visiting only 125 are highly related to the domain of interest, which is the Education
pages (e.g., |C| = 125, then |C-(CR)| = 75), then CRLs precision is domain, they also contain bad links (e.g., links point at irrele-
50/125 = 0.4. Hence, an effective focused crawler should maximize vant pages), such as advertisement and utility pages. Several link
the number of crawled pages that are related the domain (e.g., the weighting strategies are used for implementing both crawlers (e.g.,
set CR), while minimizing the retrieved irrelevant ones (e.g., the TFC and EFC), which are illustrated in Table 8. The tested crawlers
set C-CR).
202 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

Table 8
The link weighting strategies for implementing both crawlering ways (e.g., TFC and EFC).

Weighting strategy Description

Similarity To Seeds (STS) According to this strategy, the score of the currently processed page is the sum of its similarities to
the considered seeds. Hence, the input page as well as the seeds are expressed in a vector space
model of 485 dimensions (number of concepts in Education domain) {a1 ,a2 ,. . .,a485 }. In this
representation, if the concept Ci exists in the seed, the corresponding value ai will be assigned to
1 in the vector, otherwise it will be 0. Let Doc is the input page, the page score will be the sum
of all similarity scores to all seeds. Hence, the score for a page Doc is calculated by (17) (assuming z
training examples {e1 , e2 , . . .ez }). Finally, the pages score is assigned to all links extracted from it.

z

Score(Doc) = sim(Doc, ei ) (17)


i=1

 
The used similarity measure is the cosine similarity [40], which can be calculated using (18) as:
z 2
z 2
Simcos (Doc, e) = (ak (Doc).ak (e))/ ((ak (Doc)) . ((ak (e)) (18)
k (Doce) k=1 k=1
Bayesian (Bay) According to such weighting technique, the crawler uses a Bayesian similarity score to calculate
the relevancy of the processed page according to Bayes theorem using seeds as training examples.
In training phase, all seeds texts are combined, then domain keywords are extracted and mapped
to the corresponding concepts. The probability of each domain concept is calculated. When its
needed to assign a Bayesian score to an input page, pages domain keywords are extracted then
mapped to the corresponding domain concepts. The pages relevancy to the domain is calculated
(e.g., domain membership). Such domain relevance score is assigned (as a weighting score) to all
links extracted from the page.
Similarity To Domain (STD) In STD, Term Frequency (TF) is used to estimate the similarity between the currently processed
page and the domain of interest. Hence, the more the domain keywords found within the
processed page, the more the page-to-domain relevancy. Then, the calculated page-to-domain
relevancy is assigned to all links extracted from the page.
Treasure Crawler (TC) In TC [41], specic HTML elements are extracted from the input page. These elements are then
given to a domain relevancy calculator, which is supplied with the Dewy Decimal Classication
DDC system. Then, the set of DDC entries specify the crawlers topic. If the relevancy calculator
decides that an unvisited link is on-topic, then its HTML elements are compared to T-Graph nodes
and its priority score is assigned. However, if an unvisited link is off-topic, it receives the lowest
priority score and ignored by the crawler. On-topic URLs with their priority scores are then
injected into the fetcher queue for future downloads.

Table 9
Harvest Rate (HR) using different link weighting strategies in both crawlering ways (e.g., TFC and EFC).

Strategy Visited pages TFC EFC

Relevant Irrelevant Harvest Rate Relevancy Score Relevant Irrelevant Harvest Rate Relevancy Score

Similarity To Seeds 4000 2204 1796 0.551 0.469263 2245 1755 0.56125 0.530737
(STS) 5000 2240 2760 0.448 0.362875 3320 1680 0.664 0.637125
6000 2264 3736 0.377333333 0.362202 3376 2624 0.562666667 0.637798
7000 2273 4727 0.324714286 0.291012 4478 2522 0.639714286 0.708988
8000 3360 4640 0.42 0.348836 5510 2490 0.68875 0.651164
9000 3394 5606 0.377111111 0.334054 5567 3433 0.618555556 0.665946
10000 4430 5570 0.443 0.36998 6613 3387 0.6613 0.63002
Bayesian (Bay) 4000 2279 1721 0.56975 0.363967 3302 698 0.8255 0.636033
5000 3336 1664 0.6672 0.443952 3378 1622 0.6756 0.556048
6000 3398 2602 0.566333333 0.389742 4456 1544 0.742666667 0.610258
7000 4429 2571 0.632714286 0.422513 5534 1466 0.790571429 0.577487
8000 5503 2497 0.687875 0.403021 6623 1377 0.827875 0.596979
9000 5569 3431 0.618777778 0.411013 7723 1277 0.858111111 0.588987
10000 6613 3387 0.6613 0.439517 8802 1198 0.8802 0.560483
Similarity To Domain 4000 3309 691 0.82725 0.482959 3342 658 0.8355 0.517041
(STD) 5000 3369 1631 0.6738 0.410157 4421 579 0.8842 0.589843
6000 4453 1547 0.742166667 0.426159 5512 488 0.918666667 0.573841
7000 5521 1479 0.788714286 0.405317 6623 377 0.946142857 0.594683
8000 5587 2413 0.698375 0.402777 7723 277 0.965375 0.597223
9000 6614 2386 0.734888889 0.411544 8807 193 0.978555556 0.588456
10000 6653 3347 0.6653 0.37163 9901 99 0.9901 0.62837
Treasure Crawler (TC) 4000 3699 301 0.92475 0.483875 3948 52 0.987 0.516125
5000 4060 940 0.812 0.450434 4932 68 0.9864 0.549566
6000 4997 1003 0.832833 0.455419 5902 98 0.983667 0.544581
7000 5845 1155 0.835 0.465589 6822 178 0.974571 0.534411
8000 6233 1767 0.779125 0.442433 7855 145 0.981875 0.557567
9000 7188 1812 0.798667 0.449597 8853 147 0.983667 0.550403
10000 8864 1136 0.8864 0.474608 9930 70 0.993 0.525392
A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204 203

are allowed to retrieve 10000 pages, and then the crawling harvest Table 10
Different categories and the corresponding scores.
rate and average relevancy are calculated.
Category ID Description Score
7.4.1. Measuring the harvest rate 1 Very relevant 20
Harvest Rate (HR) is dened as the parentage of the retrieved 2 Relevant 15
and relevant pages over the overall retrieved pages during the 3 Medium 10
4 Related 5
crawl. Therefore, if 10 relevant pages are found in the rst 100
5 Irrelevant 0
crawled pages, then a harvest rate of 10% at 100 pages is then con-
cluded. HR can be calculated using (19), while results are shown in
Table 9. Assuming nr is the number of retrieved and relevant pages identify the accurate sense of each domain keyword extracted from
while n is the total number of retrieved pages. the input page. ONB has been tested against recent classication
HR = nr /n (19) techniques. Experimental results have proven the effectiveness of
ONB as it introduces the maximum classication accuracy. Also,
Figs. 1518 show the harvest rate of the proposed EFC as well results indicate that the proposed distiller improves the perfor-
as TFC against the number of retrieved pages using several link mance of focused crawling in terms of crawling harvest rate.
weighting strategies, which are; STS, Bay, STS, and TC respectively.
As illustrated in such gures, EFC outperforms TFC regardless of References
the used link weighting strategy. This proves the effectiveness of
adding domain distillers to traditional focused crawlers. [1] M. Razek, Credible mechanism for more reliable search engine results, Int. J.
Inf. Technol. Comput. Sci. 3 (2015) 1217.
[2] R. Shettar, R. Bhuptani, A vertical search engine based on domain classier,
7.4.2. Relevancy score Int. J. Comput. Sci. Secur. 2 (4) (2007) 1827.
In this experiment, evaluating the quality of pages retrieved by [3] A. Elyasir, K. Anbananthen, Focused web crawler, International Conference on
various crawlers (e.g., TFC and EFC) has been done by employing the Information and Knowledge Management Vol. 45 (2012).
[4] A. Sun, E. Lim, W. Ng, Web classication using support vector machine, in:
different considered weighting strategies illustrated in Table 8 (e.g., Proceedings of the 4th International Workshop on Web Information and Data
STS, Bay, STD, and TC). Generally, if we have N different crawlers, Management, New York, ACM Press, 2002, pp. 9699.
which are allowed to start from the same seeds (manually cho- [5] O. Kwon, J. Lee, Web page classication based on k-nearest neighbor
approach, in: Proceedings of the 5th International Workshop on Information
sen 10 seeds related to the Education domain) and run till each Retrieval with Asian Languages, Hong Kong,China, ACM Press, 2000, pp. 915.
crawler retrieves P pages. Hence, N * P pages (from all crawlers) [6] J. Orallo, Extending decision trees for web categorization, Proceeding of 2nd
are collected. After removing similar pages retrieved by different Annual Conference of the ICT for EU India Cross Cultural Dissemination
(2005).
crawlers, M pages are available, which are ranked using a human [7] Z. Liu, Y. Zhang, A competitive neural network approach to web-page
assessor into 5 different categories with a corresponding score categorization, Int. J. Uncertainty Fuzziness Knowl. Based Syst. 9 (6) (2001)
according to the Table 10. Then, the crawler score can be calculated 731741.
[8] A. Saleh, A. El Desouky, S. Ali, Promoting the performance of vertical
using (20).
recommendation systems by applying new classication techniques, Knowl.
Based Syst. 75 (2015) 192223.

P 
N 
P
[9] R. Navigli, Word sense disambiguation: a survey, ACM Comput. Surv. 41 (2)
Score(Crawlerj ) = score(pageji )/ ( score(pageji )) (20) (2009).
i=1 j=1 i=1 [10] H. Isahara, K. Kanzaki, Advances in natural language processing, in:
Proceedings 8th International Conference on NLP, Springer, Japan, October
2224, 2012.
Where; pageji is the ith page retrieved by the jth crawler. When such [11] T. Udapure, R. Kale, R. Dharmik, Study of web crawler and its different types,
evaluation strategy is followed using P = 4000, 5000, . . .., 10000, N = 2 IOSR J. Comput. Eng. (IOSR-JCE) 16 (1) (2014) 15.
(the number of crawling strategies, which are EFC and TFC). Result is [12] S. Brin, L. Page, The anatomy of a large-scale hypertextual Web search engine,
Comput. Netw. ISDN Syst. 30 (7) (1998) 107117.
illustrated in Figs. 1922. Those gures show that EFC outperforms
[13] M. Selvakumar, A. Vijaya, Design and development of a domain specic
TFC as it achieves the higher score, which indicates the high quality focused crawler using support vector learning strategy, Int. J. Innov. Res.
of the retrieved pages. Surely, this indicates the effectiveness of Comput. Commun. Eng. 2 (5) (2014).
adding domain distillers to traditional focused crawlers. [14] M. Jamali, H. Sayyadi, B. Hariri, H. Abolhassani, A method for focused crawling
using combination of link structure and content similarity, In Web
Intelligence, IEEE Comput. Soc. (2006) 753756.
8. Conclusion [15] S. Zheng, P. Dimitriev, C.L. Giles, Graph based crawler seed selection,
Proceedings of the 18th International Conference on World Wide Web
(WWW) (2009) 10891090.
Traditional focused crawlers rely on estimations, hence, they [16] G.A. Miller, WordNet: a lexical database for english, Commun. ACM 38 (11)
estimate whether the page is related to DOI or not before actually (1995) 3941.
[17] Y. Wang, Z. Gong, Hierarchical Classication of Web Pages Using Support
retrieving it. If the estimation decides that the page is good, the page
Vector Machine, Lecture Notes in Computer Science, Vol. 5362, Springer,
will be retrieved then passed to the indexer, then the pages links 2008, pp. 1221.
are injected into the crawling queue. However, if the estimation is [18] R. Chen, C. Hsieh, Web page classication based on a support vector machine
using a weighted vote schema, Expert Syst. Appl. 31 (2) (2006) 427435.
not accurate, the crawler will be skewed away from its target of
[19] W. Liu, G. Xue, Y. Yu, H. Zeng, Importance-based web page classication using
retrieving pages related to a specic domain. The originality of this cost-sensitive SVM, Proceedings of International Conference on Web-Age
paper is concentrated in introducing an effective modication on Information Management (2005) 127137.
the behavior of focused crawlers by employing a domain distiller [20] E. Youn, M. Jeong, Class Dependent Feature Scaling Method Using Naive Bayes
Classier for Text Data Mining, Pattern Recognition Letters, Science Direct,
to decide whether the retrieved page is related to crawlers DOI, Vol. 30, ELSEVIER, 2009, pp. 477485.
and then take a decision to index the page and add its links to the [21] W. Zhanga, F. Gaoa, An improvement to naive bayes for text classication,
crawling queue accordingly. The proposed domain distiller com- Proc. Eng. 15 (2011) 21602164.
[22] Z. Mei, Q. Shen, B. Ye, Hybridized KNN and SVM for gene expression data
bines SVM and NB classiers in a new instance called Optimized classication, Life Sci. J. 6 (1) (2009) 6166.
Nave Bayes (ONB) classier. Initially, genetic algorithm (GA) is used [23] T. Li, S. Zhu, M. Ogihara, Text categorization via generalized discriminant
to optimize the soft margins of SVM. Then the optimized SVM is analysis, Inf. Process. Manage. 44 (5) (2008) 16841697.
[24] F. Li, Y. Yang, A Loss Function Analysis for Classication Methods in Text
employed to discard the outliers from the available training exam- Categorization, ICML, 2003, pp. 472479.
ples. Next, the pruned examples are used to train the traditional NB [25] D. Miao, Q. Duan, H. Zhang, N. Jiao, Rough set based hybrid algorithm for text
classier. Moreover, ONB employs word sense disambiguation to classication, Expert Syst. Appl. 36 (5) (2009) 91689174.
204 A.I. Saleh et al. / Applied Soft Computing 53 (2017) 181204

[26] G. Ifrim, A. Bayesian, Learning Approach to Concept-Based Document [35] F. Alabsi, R. Naoum, Comparison of selection methods and crossover
Classication, Computer Science Dept., Saarland University, Saarbrucken, operations using steady state genetic based intrusion detection system, J.
Germany, 2005 (February M.Sc Thesis). Emerg. Trends Comput. Inf. Sci. 3 (7) (2012).
[27] R. Vinoth, A. Jayachandran, M. Balaji, R. Srinivasan, A hybrid text classication [36] H. Ali, A. El Desouky, A. Saleh, Studying and analysis of a vertical web page
approach using KNN and SVM, Int. J. Adv. Found. Res. Comput. (IJAFRC) 1 (3) classier based on continuous learning Nave bayes (CLNB) algorithm, IGI
(2014) 23484853. Global (2009) 210245.
[28] N. Tripathia, M. Oakesa, S. Wermterb, Hybrid classiers based on semantic [37] R. Meusel, P. Petrovski, C. Bizer, The web data commons microdata, RDFa and
data subspaces for two-level text categorization, Int. J. Hybrid Intell. Syst. 10 microformat dataset series, in: Proceedings of the 13th International
(2013) 3341. Semantic Web Conference (ISWC 2014), Italy, Springer Berlin Heidelberg,
[29] A. Sey, A. Patel, J.C. Jnior, Empirical evaluation of the link and content-based 2014, pp. 277292.
focused Treasure-Crawler, Comput. Stand. Interf. 44 (2016) 5462. [38] Y. Lin, J. Jiang, S. Lee, A similarity measure for text classication and
[30] S. Yang, OntoCrawler: a focused crawler with ontology-supported website clustering, IEEE Trans. Knowl. Data Eng. 26 (7) (2014) 15751590.
models for information agents, Expert Syst. Appl. 37 (2010) 53815389. [39] C. Hsu, C. Chang, C. Lin, A Practical Guide to Support Vector Classication,
[31] G. Forman, An extensive empirical study of feature selection metrics for text Technical Report, Department of Computer Science and Information
classication, J. Mach. Learn. Res. 3 (2003) 12891305. Engineering, National Taiwan University, Taipei, 2003.
[32] D. Bollegala, Y. Matsuo, M. Ishizuka, Measuring semantic similarity between [40] J. Zhao, M. Lan, J. Tian, ECNU: using traditional similarity measurements and
words using web search engines, Proceedings of International Conference on word embedding for semantic textual similarity estimation, in: Proceedings
World Wide Web (2007) 757766. of the 9th International Workshop on Semantic Evaluation (SemEval 2015),
[33] V. Cherkassky, M. Yunqian, Practical selection of SVM parameters and noise Denver, Colorado, June, 2015, pp. 117122.
estimation for SVM regression, Neural Netw. 17 (1) (2004) 113126. [41] A. Sey, A. Patel, J. Jnior, Empirical evaluation of the link and content based
[34] T. Pencheva, K. Atanassov, A. Shannon, Modelling of a roulette wheel selection focused Treasure Crawler, Comput. Stand. Interf. 44 (2016) 5462.
operator in genetic algorithms

Vous aimerez peut-être aussi