Académique Documents
Professionnel Documents
Culture Documents
Web pages Few thousand characters long Served through the internet using the hypertext
transport protocol (HTTP) Viewed at client end using `browsers
documents
HTML
HyperText Markup Language Lets the author specify layout and typeface embed diagrams create hyperlinks.
expressed
as an anchor tag with a HREF attribute HREF names another page using a Uniform Resource Locator (URL),
URL =
protocol
field (HTTP) + a server hostname (www.cse.iitb.ac.in) + file path (/, the `root' of the published file system).
Mining the Web Chakrabarti and Ramakrishnan 2
Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings maintained at a set of known servers connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header
MIME (Multipurpose Internet Mail Extensions) A meta-data standard for email and Web content transfer
Crawling procedure
Simple Great deal of engineering goes into industrystrength crawlers Industry crawlers crawl a substantial fraction of the Web E.g.: Alta Vista, Northern Lights, Inktomi
No guarantee that all accessible Web pages will be located in this fashion Crawler may never halt . pages will be added continually even as it is
running.
Mining the Web Chakrabarti and Ramakrishnan 5
Crawling overheads
Delays involved in Resolving the host name in the URL to an IP
address using DNS Connecting a socket to the server and sending the request Receiving the requested page in response
Solution: Overlap the above delays by fetching many pages at the same time
Anatomy of a crawler.
Page fetching threads Starts with DNS resolution Finishes when the entire page has been
fetched
Each page stored in compressed form to disk/tape scanned for outlinks Work pool of outlinks maintain network utilization without
overloading it
Dealt
Mining the Web
Need to fetch many pages at same time utilize the network bandwidth single page fetch may involve several seconds of
network latency
Highly concurrent and parallelized DNS lookups Use of asynchronous sockets Explicit encoding of the state of a fetch context in a
data structure Polling socket to check for completion of network transfers Multi-processing or multi-threading: Impractical
Care in URL extraction Eliminating duplicates to reduce redundant fetches traps Mining Web the Avoiding spider Chakrabarti and Ramakrishnan
A customized DNS component with.. 1. Custom client for address resolution 2. Caching server 3. Prefetching client
10
Tailored for concurrent handling of multiple outstanding requests Allows issuing of many resolution requests together polling at a later time for completion of
individual requests
11
Caching server
With a large cache, persistent across DNS restarts Residing largely in memory if possible.
12
Prefetching client
Steps 1. Parse a page that has just been fetched 2. extract host names from HREF targets 3. Make DNS resolution requests to the
caching server
13
14
Multi-threading
logical threads physical thread of control provided by the operating
system (E.g.: pthreads) OR concurrent processes
fixed number of threads allocated in advance programming paradigm create a client socket connect the socket to the HTTP service on a server Send the HTTP request header read the socket (recv) until
Multi-threading: Problems
performance penalty mutual exclusion concurrent access to data structures slow disk seeks. great deal of interleaved, random input-output
on disk Due to concurrent modification of document repository by multiple threads
16
select system call lets application suspend until more data can be read
from or written to the socket timing out after a pre-specified deadline Monitor polls several sockets at the same time
More efficient memory management code that completes processing not interrupted by
other completions No need for locks and semaphores on the pool only append complete and Ramakrishnanthe log pages to Mining the Web Chakrabarti
17
Proxy pass
Mapping of different host names to a single IP address need to publish many logical sites
Relative URLs
Canonical URL
Formed by Using a standard string for the protocol Canonicalizing the host name Adding an explicit port number Normalizing and cleaning up the path
19
Robot exclusion
Check whether the server prohibits crawling a
normalized URL In robots.txt file in the HTTP root directory of the server
species a list of path prefixes which crawlers should not attempt to fetch.
20
qualifying URLs added to frontier of the crawl. hash values added to B-tree.
Mining the Web Chakrabarti and Ramakrishnan 21
Spider traps
Protecting from crashing on Ill-formed HTML
E.g.:
Misleading sites
indefinite
number of pages dynamically generated by CGI scripts paths of arbitrary depth created using soft directory links and path remapping features in HTTP server
22
23
Reduce redundancy in crawls Duplicate detection Mirrored Web pages and sites Detecting exact duplicates Checking against MD5 digests of stored URLs Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
Solution : Shingling
Mining the Web
Load monitor
Keeps track of various system statistics
25
Thread manager
Responsible for Choosing units of work from frontier Scheduling issue of network resources Distribution of these requests over multiple
ISPs if appropriate.
26
server IP address at any time maintain a queue of requests for each server
Use the HTTP/1.1 persistent socket capability.
Text repository
Crawlers last task Dumping fetched pages into a repository Decoupling crawler from other functions for efficiency and reliability preferred Page-related information stored in two parts meta-data page contents.
28
29
Page Storage
Small-scale systems Repository fitting within the disks of a single
machine Use of storage manager (E.g.: Berkeley DB)
Manage disk-based databases within a single file configuration as a hash-table/B-tree for URL access key
To handle ordered access of pages
31
Page Storage
Large Scale systems Repository distributed over a number of
storage servers Storage servers
Connected to the crawler through a fast local network (E.g.: Ethernet) Hashed by URLs
32
Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.
33
35
Prerequisite average interval at which crawler checks for Small scale intermediate crawler runs to monitor fast changing sites
Mining the Web
37
To copy bytes from network sockets to storage media Three methods to express Crawler's contract with user pushing a URL to be fetched to the Crawler
(fetchPush) Termination callback handler (fetchDone) called with same URL Method (start) which starts Crawler's event loop.
Implementation of Crawler class Need for two helper classes called DNS and Fetch
38