Multiagent WebScrapper System for Data Extraction

WebScrapper
A multiagent system for web crawling
Neves, Joo
Departamento de Informtica Universidade do Minho Braga, Portugal joaopmneves@gmail.com
AbstractThe present work demonstrates the usage of a cooperative Multiagent system for recovering properly formated information from auction and sales portals. The JADE (Java Agent Development Environment) [1] toolkit was used to accomplish this task. Weve choosen to use a multi-agent arquitecture because the different tasks envolved in this process can be executed independently with simple and task oriented agents in a cooperative environment. Keywords: multiagent systems; web crawling; datawarehouses; JADE;web portals.
fairly complex task of extracting data from online selling systems; to be able to escalate the system so has to increase the number of entities collecting data according to the needs of the application. The JADE toolkit was chosen because it offers a ready made platform to develop on top and is compliant with FIPA standarts. Its funded by large corporations and its based on open source phylosophy. This was a research project that might evolve to be a final aplication so the costs of investment on software were also accounted for in the decision process. B. WebScapper system
I.
INTRODUCTION
In todays internet times the proliferation of auction and sales portals has increased the offer to the general public, eliminating at times the intermediate entities that had previously been the primary interface between the manufacters of goods and the consumer in the distribution chain. This has created a mixed environment where private individuals compete with companies selling the same kind of items. This has particular importance in the business of used items: car, houses, electronic equipment etc. We propose a model that allows the creation of a system in the form of a prototype to collect properly formated data from online auction and sales portals to make it usefull and thus transforming the extracted data into usable and usefull information. The system can act as a midleware system functioning as a gateway for automatic retrieval functions. Its collected information can later be used to feed other systems such as datawarehouses, portals, search engines etc. II. STRUCTURE OF WEBSCRAPPER
Figure 1.
A. Motivation The project described does have some similarities with shopbots[3] as defined by Fasli. However shopbots have a more broad usage and our objectives were considerably simpler. Our purpose was to scrapp from the html returned in web pages the relevant information according to very precise criteria. Weve choosen a multiagent arquitecture for the following reasons: to decouple the different tasks envolved in the collection of data from online resources; to create simple task oriented autonomous agents that together make it possible to construct a unified system capable of accomplishing the
Datinfor - was the main sponsor for this project: http://www.datinfor.com
The system is divided into three types of agents: a master, a crawler and a specialized extractor that is constructed according to the site that is subject to data scrapping. The Master/Crawler cooperate to the schedulling job of fetching of data from a presented URL. They were made separate so has to be able to have several fetching agents (crawlers) and to be able to make them anonimate if necessary. This is sometimes a requirement that depends on the amount of data to be extracted and how paranoid the external sites are to automatic non-human browsing. Based on this they can block access. The master agent (we are assuming that only one is necessary) acts as a maestro that is responsible for distributing URLs to be fetched by the crawler community.
The extractor agents work on the retrieved data and according to their definitions, create the queries to remote sites, parse the retrieved contents, might create adicional URLs to be retrieved from the Web Sites. They work in stages that we will elaborate later on. To glue everything together weve chosen to save the data collected to database tables. We also use the same database to store some auxiliary tables that are used to control and manage the comunication between the agents. In the prototype the crawler agents do not need to access the database. The master/crawler exchange messages in FIPA compliant protocol. The information exchanged between them is URL/HTML for each. The URL to be fetched is transmited to the crawler by the master and the crawler replies with the fetched data in the form returned by the crawled site. This design choice was made to be able to distribute the agents to many sites with minimum requirements regarding its comunication infraestructure. The system should be able to use existing security systems (firewalls, proxies etc). The prototype was conceived to work in batch and to have its final data stored in a datawharehouse. The interfaces from most online auction and sales sites differ from each other significantly and only very few of them offer methods for machine to machine integration. In order to be able to integrate these sites weve created the extractor agent. The most intrincate and complex programming job of the whole system was on the extractors agent code. Also it is the only agent that because of its fairly complex structure denotes some perceived intelligence. Initially it was in our plans to have a general extractor that would process regular expressions on retrieved data and the final result would be the extracted information. It was created a small prototype to try to accomplish this. After initial hands-on it a decision was made away from it because the necessary structure was too complex to maintain and follow and that colided with our intention to design a simple system. Also, the restriction to only use regular expressions resulted in a great loss of richness that is offered by a complete language like java when opposed with regular expressions scripts for data extraction. Our final decision was to create an agent for each site that would be responsible to data collection and extraction activities. In the analysis done it was identified three stages that needed to be processed when retrieving data from online shopping sites: query construction; list processing; detail processing. The first stage, query, is necessary to identify the input variables and syntax of the site to construct the proper URL for data retrieval. Also it is this stage that initiates the data collection activities of the whole system; for the second stage, list processing, the agent has to be able to extract the relevant links that permit the access to the details of the retrieved records and also the pagination links in case there is a need to extract all pages from the site (relevant for the original query); the last stage, detail processing extracts the final data from the site which in our prototype is going to be added to a datawarehouse. In certain situations the last step can be skiped when the list returned from the site has enough information to fullfill the requirements of the retrieval task. In these cases the last two stages will merge and function as one.
The comunication between the master (the one agent that supervises the actual external data extraction) and the extractors is done via a state table. In this table we maintain the type of extractor record, to distinguish between different collecting sites, an URL, the actual retrieved data, the agents that executed the jobs (extraction crawler, processing extractor) the extraction stage and several flag indicators. New entries are always created by extractor agents and are updated in turn by the master and extractors as the request goes thru the processing phases. The comunication between master/crawler is done by using standard FIPA-ACL performatives. A PROPOSE is sent to all crawler agents with a URL. An INFORM is sent from the crawler informing the master it will fetch the URL. The master sends a CONFIRM to the crawler that acknowledges that it accepts the job and the crawler will send a reply with the retrieved data and the performative AGREE. One issue that we needed to account for in this project was that of coordination of the comunication between the diferent agents acting on the environment. We will separate this item twofold: coordination between master/crawler and coordination between master/extractors. For the first relationship, master/crawler, it was defined the master as the entity responsible for the distribution of work amoung crawlers. The comunication protocol described above takes care of assigning a task to only one agent at a time. The master agent collects jobs from the requests table that are flagged as to be crawled and hands them to the community of registered crawlers. When the task of fetching data is finished the master writes the results in the requests table and updates a flag accordingly. The relationship and interaction between Extractors/Masters is like the following: when the extractor needs data from the internet will create an entry in the requests table indicating the URL it wants extracted with a raised flag for crawling processing. The interaction of the extractor with the requests table is a bit more complex for it is this table that controls the course of action to be followed by the extractor agent. It takes two roles: one that creates new requests on the table; the other that analyses the requests and acts according to the stage the request is on. It is also responsible for updating a flag that indicates the request has been processed. As the tasks are performed the request is continuosly updated by the executing agent. The input to the system on this prototype is done by a user window that initiates the query and starts the whole system interaction. The output of the system is placed on a database table that precisely defines the collected data according to a pre-defined data structure.
III.
CONCLUSIONS
With the present work it was possible to create a simple and functioning system that permits the extraction of data from external auction and online sales sites. The multiagent arquitecture allows for an easy process of escalation as the number of requests grows. The designed system does not
provide a final solution per si but allows it to be used as a midleware for a wide range of solutions: federated search systems, data collection crawlers, on line coaching systems etc. IV. FUTURE WORK
REFERENCES
[1] [2] [3] [4] Java Agent development framework. http://jade.cselt.it Bellifemine, Fabio, Giovanni Caire, Dominic Greenwood, "Developing multi-agent systems with JADE". Wiley: Sussex, 2007 pp. 19-20. M. Fasli, Shopbots: A syntactic present, a semantic future, IEEE Internet Computing, 10:6 (IEEE Press), 2006. Nikraz, Magid, Caire, Giovanni, Bahri, Parisa A., "A Methodology for the analysis and design of multi-Agent systems using JADE". Murdock University: Rockingham, 2006, pp. 10-11. Wooldrige M., An Introduction to Multiagent Systems,John Wiley & Sons, ISBN 0 47149691X, 2002.
A ontology will defined for data extraction, that will create a standard method to define the query expressions according to a specific domain of knowledge. We will also explore the Web Services interface provided by the JADE platform so as to create standardized interface for online services. This interface could then be used as a transducer [4] to onlines auction and sales sites and expose the retrieved information to a general search engines using standard search protocols (SRU, z39.50, Opensearch).
[5]

Multiagent WebScrapper System for Data Extraction

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Multiagent WebScrapper System for Data Extraction

Transféré par

Droits d'auteur :

Formats disponibles

WebScrapper

A multiagent system for web crawling

Vous aimerez peut-être aussi