Vous êtes sur la page 1sur 7

Page | 1

TOPIC ORIENTED COLLABORATIVE DISRTIBUTIVE


CRAWLER

Page | 2

Abstract
A major concern in the implementation of a distributed Web crawler is the choice of a
strategy for partitioning the Web among the nodes in the system. Our goal in selecting this strategy
is to minimize the overlap between the activities of individual nodes. We propose a topic-oriented
approach, in which the Web is partitioned into general subject areas with a crawler assigned to
each. We examine design alter-natives for a topic-oriented distributed crawler and analyze the
performance of the implemented crawler The experimental evaluation demonstrates the feasibility
of the approach, addressing issues of communication overhead, duplicate content detection.
Introduction
A crawler is a program that gathers resources from the Web. Web crawlers are widely used
to gather pages for indexing by Web search engines but may also be used to gather information for
Web data mining, for question answering, and for locating pages with specific content. A crawler
operates by maintaining a pending queue of URLs that the crawler intends to visit.
At each stage of the crawling process a URL is removed from the pending queue, the
corresponding page is retrieved, URLs are extracted from this page, and some or all of these URLs
are inserted back into the pending queue for future processing. For performance, crawlers often use
asynchronous I/O to al-low multiple pages to be downloaded simultaneously or are structured as
multithreaded programs, with each thread executing the basic steps of the crawling process
concurrently with the others.
Existing System
In this section we out-line the design of a topic-oriented collaborative crawler that uses a text
classifier to assign pages to nodes. Given the contents of a Web page, the classifier assigns the page
to one of n distinct subject categories. Each subject category is associated with a local crawler.
When the classifier assigns a page to a remote node, the local crawler transfers it to its assigned
node for further processing.
A topic-oriented collaborative crawler may be viewed as a set of broad-topic focused
crawlers that partition the Web between them. The breadth of the subject categories depends on the
value of n. For n in the range 10-20, two of the subject categories might be BUSINESS and
Page | 3

SPORTS. For larger n, the subject categories will be narrower, such as INVESTING, FOOT-BALL
and HOCKEY. The authors of the given paper used the Open Directory Project (ODP) to train their
system to classify the pages. ODP according to the authors is a self-regulated organization
maintained by volunteer experts who categories URLS into hierarchical class directory,.
Proposed System
In this project we try to implement a system that has three crawlers each of which has a
particular topic associated with it that it is trying to crawl. The three topics we have crawlers for
currently are Sports, Business, and Science. Each crawler when started gets its topic assigned to it
and then gets a start page to start crawl that topic from. When the crawler goes to a page it will first
check to make sure the page is in the category that it is crawling. If it is a good page it will add to
the good links list and all the links from it will be added to the open list. If the website is off topic it
is added to a list of bad links and no links from that page are added to the open list.
Each web crawler keeps track of its own data when a client wants to receive information
about sports it simply has to ask the sports crawler for this information. Each crawler has its own
set of data that can be used either completely by itself from the crawler. If a system needed
information from all the crawlers it would have to simply ask each crawler for its information. This
design feature allows the design to stay away from one central location for all the data. So if one
crawler goes away the other data would still be available for the system.
We also needed to be able to classify pages. Unlike the researchers we decided to go with a
simpler implementation than trying to train our system to know what page content meant what. We
simply decided to look at the page for a bunch of keywords about each of topics. Based on how
many matches were on the page we would then decided whether we had found a match to the topic
or not. Another thing we took from this paper was the idea of not having a central storage location.
In our system if a client wants information about a topic it sends a message directly to that crawler
that is crawl that topic.


Page | 4

INTRODUCTION

The purpose of this section is to provide the reader with general, background
information about the software Web Crawling.
Purpose
This document is the Software Requirement Specification for the Topic Oriented
Distributed Web Crawler. This SRS describes the functions and performance
requirements of the Web Crawler. Initially the Search Engine crawls and indexes web
pages in through the command prompt and displays the results in Java Applets. We
propose Software Requirement Specification for the Topic Oriented Distributed Web
Crawler. This SRS describes the functions and performance requirements of the Web
Crawler
Scope
With a 4 months time constraint we students have looked into the analysis of the
Web Crawlers and its design, implementation and integration of modules (it consists three
main modules).
For gaining an insight into how the existing crawlers works, a Comparative study
of various features the several engines offer have been made. A survey of the existing
Crawlers which have been working in the background of various search engines and
also been conducted in order to understand the in-addition expectations from current
crawling application. The planning stage and requirement gathering stage is a base work
for further analysis and design. Hence planning and requirement gathering stage has also
been allotted a time period of 2 months.


Page | 5

Objective
A major concern in the implementation of a distributed Web crawler is the choice of
a strategy for partitioning the Web among the nodes in the system. Our goal in selecting
this strategy is to minimize the overlap between the activities of individual nodes. We
propose a topic-oriented approach, in which the Web is partitioned into general subject
areas with a crawler assigned to each. We examine design alter-natives for a topic-oriented
distributed crawler and analyze the performance of the implemented crawler The
experimental evaluation demonstrates the feasibility of the approach, addressing issues of
communication overhead, duplicate content detection.
INTENDED AUDIENCE AND READING SUGGESTIONS

This document is meant for users, developers, project managers, testers, and
documentation writers. The SRS document aims to explain in an easy manner, the basic
idea behind the BATSS Search Engine and how the developers aim to achieve their goals.
It also aims to introduce to the users the main features of the BATSS Search Engine and
what makes it different from other Search Engines like
Google, Yahoo! Etc.
For gaining an insight into how the existing search engine works, a study of various
features the several engines offer have been made. A survey of the existing search engines
has also been conducted in order to understand the in-addition expectations from current
search engine.

Existing System
In this section we out-line the design of a topic-oriented collaborative crawler that
uses a text classifier to assign pages to nodes. Given the contents of a Web page, the
classifier assigns the page to one of n distinct subject categories. Each subject category is
Page | 6

associated with a local crawler. When the classifier assigns a page to a remote node, the
local crawler transfers it to its assigned node for further processing.
A topic-oriented collaborative crawler may be viewed as a set of broad-topic
focused crawlers that partition the Web between them. The breadth of the subject
categories depends on the value of n. For n in the range 10-20, two of the subject
categories might be BUSINESS and SPORTS. For larger n, the subject categories will be
narrower, such as INVESTING, FOOT-BALL and HOCKEY. The authors of the given
paper used the Open Directory Project (ODP) to train their system to classify the pages.
ODP according to the authors is a self-regulated organization maintained by volunteer
experts who categories URLS into hierarchical class directory,.
Proposed System
In this project we try to implement a system that has three crawlers each of which
has a particular topic associated with it that it is trying to crawl. The three topics we have
crawlers for currently are Sports, Business, and Science. Each crawler when started gets its
topic assigned to it and then gets a start page to start crawl that topic from. When the
crawler goes to a page it will first check to make sure the page is in the category that it is
crawling. If it is a good page it will add to the good links list and all the links from it will
be added to the open list. If the website is off topic it is added to a list of bad links and no
links from that page are added to the open list.
Each web crawler keeps track of its own data when a client wants to receive
information about sports it simply has to ask the sports crawler for this information. Each
crawler has its own set of data that can be used either completely by itself from the
crawler. If a system needed information from all the crawlers it would have to simply ask
each crawler for its information. This design feature allows the design to stay away from
one central location for all the data. So if one crawler goes away the other data would still
be available for the system.
Page | 7

We also needed to be able to classify pages. Unlike the researchers we decided to
go with a simpler implementation than trying to train our system to know what page
content meant what. We simply decided to look at the page for a bunch of keywords about
each of topics. Based on how many matches were on the page we would then decided
whether we had found a match to the topic or not. Another thing we took from this paper
was the idea of not having a central storage location. In our system if a client wants
information about a topic it sends a message directly to that crawler that is crawl that topic.
SOFTWARE AND HARDWARE REQUIREMENTS
Software Requirements
Operating System : Any 32 Bit OS
Platform : Windows
Language : JAVA
Technology : Swings

Hardware Requirements
Processor : Pentium 4
Ram : 512 Mb
Hard Disk : 256 Mb
Monitor : VGA Color (256)

Vous aimerez peut-être aussi