Crawl

Crawling the Web
Web pages Few thousand characters long Served through the internet using the hypertext
transport protocol (HTTP) Viewed at client end using `browsers
Crawler To fetch the pages to the computer At the computer

Automatic
programs can analyze hypertext
documents
HTML
HyperText Markup Language Lets the author specify layout and typeface embed diagrams create hyperlinks.
expressed
as an anchor tag with a HREF attribute HREF names another page using a Uniform Resource Locator (URL),
URL =
protocol
field (HTTP) + a server hostname (www.cse.iitb.ac.in) + file path (/, the `root' of the published file system).
Mining the Web Chakrabarti and Ramakrishnan 2
HTTP(hypertext transport protocol)

Built on top of the Transport Control Protocol (TCP) Steps(from client end) resolve the server host name to an Internet address
(IP)

Use Domain Name Server (DNS) DNS is a distributed database of name-to-IP mappings maintained at a set of known servers connect to default HTTP port (80) on the server. Enter the HTTP requests header (E.g.: GET) Fetch the response header
MIME (Multipurpose Internet Mail Extensions) A meta-data standard for email and Web content transfer
contact the server using TCP

Mining the Web
Fetch the HTML page
Chakrabarti and Ramakrishnan
Crawl all Web pages?

Problem: no catalog of all accessible URLs on the Web. Solution: start from a given set of URLs Progressively fetch and scan them for new
outlinking URLs fetch these pages in turn.. Submit the text in page to a text indexing system and so on.
Crawling procedure
Simple Great deal of engineering goes into industrystrength crawlers Industry crawlers crawl a substantial fraction of the Web E.g.: Alta Vista, Northern Lights, Inktomi
No guarantee that all accessible Web pages will be located in this fashion Crawler may never halt . pages will be added continually even as it is
running.
Crawling overheads
Delays involved in Resolving the host name in the URL to an IP
address using DNS Connecting a socket to the server and sending the request Receiving the requested page in response
Solution: Overlap the above delays by fetching many pages at the same time
Mining the Web
Anatomy of a crawler.
Page fetching threads Starts with DNS resolution Finishes when the entire page has been
fetched
Each page stored in compressed form to disk/tape scanned for outlinks Work pool of outlinks maintain network utilization without
overloading it
Dealt
Mining the Web
with by load manager

Chakrabarti and Ramakrishnan 7
Continue till he crawler has collected a sufficient number of pages.
Typical anatomy of a large-scale crawler.

Need to fetch many pages at same time utilize the network bandwidth single page fetch may involve several seconds of
network latency
Large-scale crawlers: performance and reliability considerations
Highly concurrent and parallelized DNS lookups Use of asynchronous sockets Explicit encoding of the state of a fetch context in a
data structure Polling socket to check for completion of network transfers Multi-processing or multi-threading: Impractical
Care in URL extraction Eliminating duplicates to reduce redundant fetches traps Mining Web the Avoiding spider Chakrabarti and Ramakrishnan
A customized DNS component with.. 1. Custom client for address resolution 2. Caching server 3. Prefetching client
DNS caching, pre-fetching and resolution
Mining the Web
10
Tailored for concurrent handling of multiple outstanding requests Allows issuing of many resolution requests together polling at a later time for completion of
individual requests
Custom client for address resolution
Facilitates load distribution among many DNS servers.
Mining the Web
11
Caching server
With a large cache, persistent across DNS restarts Residing largely in memory if possible.
Mining the Web
12
Prefetching client
Steps 1. Parse a page that has just been fetched 2. extract host names from HREF targets 3. Make DNS resolution requests to the
caching server
Usually implemented using UDP
User Datagram Protocol connectionless, packet-based
communication protocol does not guarantee packet delivery
Mining the Web
Does not wait for resolution to be completed.

13
Multiple concurrent fetches

Managing multiple concurrent connections
A single download may take several
seconds Open many socket connections to different HTTP servers simultaneously
Multi-CPU machines not useful
crawling performance limited by network and

disk
Mining the Web
Two approaches 1. using multi-threading 2. using non-blocking sockets with event

14
Multi-threading
logical threads physical thread of control provided by the operating
system (E.g.: pthreads) OR concurrent processes
fixed number of threads allocated in advance programming paradigm create a client socket connect the socket to the HTTP service on a server Send the HTTP request header read the socket (recv) until
no more characters are available
close the socket. use blocking system calls

Multi-threading: Problems
performance penalty mutual exclusion concurrent access to data structures slow disk seeks. great deal of interleaved, random input-output
on disk Due to concurrent modification of document repository by multiple threads
Mining the Web
16
non-blocking sockets connect, send or recv call returns immediately without

waiting for the network operation to complete. poll the status of the network operation separately
Non-blocking sockets and event handlers
select system call lets application suspend until more data can be read
from or written to the socket timing out after a pre-specified deadline Monitor polls several sockets at the same time
More efficient memory management code that completes processing not interrupted by
other completions No need for locks and semaphores on the pool only append complete and Ramakrishnanthe log pages to Mining the Web Chakrabarti
17
Link extraction and normalization

Goal: Obtaining a canonical form of URL URL processing and filtering Avoid multiple fetches of pages known by
different URLs many IP addresses

For load balancing on large sites

Mirrored contents/contents on same file system
Proxy pass
Mapping of different host names to a single IP address need to publish many logical sites
Relative URLs
Mining the Web
need to be interpreted w.r.t to a base URL.

Canonical URL
Formed by Using a standard string for the protocol Canonicalizing the host name Adding an explicit port number Normalizing and cleaning up the path
Mining the Web
19
Robot exclusion
Check whether the server prohibits crawling a
normalized URL In robots.txt file in the HTTP root directory of the server
species a list of path prefixes which crawlers should not attempt to fetch.
Meant for crawlers only
Mining the Web
20
Eliminating already-visited URLs

Checking if a URL has already been fetched Before adding a new URL to the work pool Needs to be very quick. Achieved by computing MD5 hash function on the
URL
Exploiting spatio-temporal locality of access
Two-level hash function.

most significant bits (say, 24) derived by hashing the host name plus port lower order bits (say, 40) derived by hashing the path
concatenated bits use d as a key in a B-tree
qualifying URLs added to frontier of the crawl. hash values added to B-tree.
Spider traps
Protecting from crashing on Ill-formed HTML
E.g.:
page with 68 kB of null characters
Misleading sites
indefinite
number of pages dynamically generated by CGI scripts paths of arbitrary depth created using soft directory links and path remapping features in HTTP server
Mining the Web
22
Spider Traps: Solutions

No automatic technique can be foolproof Check for URL length Guards Preparing regular crawl statistics Adding dominating sites to guard module Disable crawling active content such as CGI
form queries Eliminate URLs with non-textual data types
Mining the Web
23
Reduce redundancy in crawls Duplicate detection Mirrored Web pages and sites Detecting exact duplicates Checking against MD5 digests of stored URLs Representing a relative link v (relative to aliases u1
and u2) as tuples (h(u1); v) and (h(u2); v)
Avoiding repeated expansion of links on duplicate pages
Detecting near-duplicates Even a single altered character will completely

change the digest !
E.g.: date of update/ name and email of the site administrator

Solution : Shingling
Mining the Web
Load monitor

Keeps track of various system statistics
Recent performance of the wide area

network (WAN) connection
E.g.: latency and bandwidth estimates.
Operator-provided/estimated upper bound

on open sockets for a crawler Current number of active sockets.
Mining the Web
25
Thread manager
Responsible for Choosing units of work from frontier Scheduling issue of network resources Distribution of these requests over multiple
ISPs if appropriate.
Uses statistics from load monitor
Mining the Web
26
Per-server work queues

Denial of service (DoS) attacks limit the speed or frequency of responses to
any fixed client IP address
Avoiding DOS limit the number of active requests to a given

server IP address at any time maintain a queue of requests for each server
Use the HTTP/1.1 persistent socket capability.
Distribute attention relatively evenly between

a large number of sites
Access locality vs. politeness dilemma

Text repository
Crawlers last task Dumping fetched pages into a repository Decoupling crawler from other functions for efficiency and reliability preferred Page-related information stored in two parts meta-data page contents.
Mining the Web
28
Storage of page-related information

Meta-data relational in nature
usually managed by custom software to avoid relation database system overheads text index involves bulk updates

includes fields like content-type, last-modified

date, content-length, HTTP status code, etc.
Mining the Web
29
Page contents storage

Typical HTML Web page compresses to 24 kB (using zlib) File systems have a 4-8 kB file block size Too large !! Page storage managed by custom storage manager simple access methods for
crawler to add pages Subsequent programs (Indexer etc) to retrieve documents

Page Storage
Small-scale systems Repository fitting within the disks of a single
machine Use of storage manager (E.g.: Berkeley DB)

Manage disk-based databases within a single file configuration as a hash-table/B-tree for URL access key
To handle ordered access of pages

configuration as a sequential log of page records.

Since Indexer can handle pages in any order
Mining the Web
31
Page Storage
Large Scale systems Repository distributed over a number of
storage servers Storage servers
Connected to the crawler through a fast local network (E.g.: Ethernet) Hashed by URLs

`T3' grade leased lines.

To handle 10 million pages (40 GB) per hour
Mining the Web
32
Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.
Mining the Web
33
Refreshing crawled pages

Search engine's index should be fresh Web-scale crawler never `completes' its job High variance of rate of page changes If-modified-since request header with HTTP protocol Impractical for a crawler Solution At commencement of new crawling round
estimate which pages have changed
Determining page changes

Expires HTTP response header For page that come with an expiry date Otherwise need to guess if revisiting that page will yield a modified version. Score reflecting probability of page being
modified Crawler fetches URLs in decreasing order of score. Assumption : recent past predicts the future
Mining the Web
35
Estimating page change rates

Brewington and Cybenko & Cho Algorithms for maintaining a crawl in which
most pages are fresher than a specified epoch.
Prerequisite average interval at which crawler checks for Small scale intermediate crawler runs to monitor fast changing sites

Mining the Web
changes is smaller than the inter-modification times of a page
E.g.: current news, weather, etc.

Patched intermediate indices into master

index
Putting together a crawler

Reference implementation of the HTTP client
protocol
World-wide Web Consortium (http://www.w3c.org/) w3c-libwww package

Mining the Web
37
To copy bytes from network sockets to storage media Three methods to express Crawler's contract with user pushing a URL to be fetched to the Crawler
Design of the core components: Crawler class.
(fetchPush) Termination callback handler (fetchDone) called with same URL Method (start) which starts Crawler's event loop.
Implementation of Crawler class Need for two helper classes called DNS and Fetch
Mining the Web
38

Crawl

Transféré par

Informations du document

Description originale:

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Crawl

Transféré par

Droits d'auteur :

Formats disponibles

Crawling the Web

Crawler To fetch the pages to the computer At the computer

programs can analyze hypertext

HTTP(hypertext transport protocol)

contact the server using TCP

Mining the Web

Fetch the HTML page

Chakrabarti and Ramakrishnan

Crawl all Web pages?

Mining the Web

Chakrabarti and Ramakrishnan

with by load manager

 Continue till he crawler has collected a sufficient number of pages.

Typical anatomy of a large-scale crawler.

Large-scale crawlers: performance and reliability considerations

DNS caching, pre-fetching and resolution

Mining the Web

Chakrabarti and Ramakrishnan

Custom client for address resolution

 Facilitates load distribution among many DNS servers.

Mining the Web

Chakrabarti and Ramakrishnan

Mining the Web

Chakrabarti and Ramakrishnan

Usually implemented using UDP

User Datagram Protocol connectionless, packet-based

communication protocol does not guarantee packet delivery

Mining the Web

Does not wait for resolution to be completed.

Multiple concurrent fetches

A single download may take several

seconds Open many socket connections to different HTTP servers simultaneously

Multi-CPU machines not useful

crawling performance limited by network and

Mining the Web

Two approaches 1. using multi-threading 2. using non-blocking sockets with event

no more characters are available

close the socket. use blocking system calls

Mining the Web

Chakrabarti and Ramakrishnan

non-blocking sockets connect, send or recv call returns immediately without

Non-blocking sockets and event handlers

Link extraction and normalization

For load balancing on large sites

Mining the Web

need to be interpreted w.r.t to a base URL.

Mining the Web

Chakrabarti and Ramakrishnan

Meant for crawlers only

Mining the Web

Chakrabarti and Ramakrishnan

Eliminating already-visited URLs

 Exploiting spatio-temporal locality of access

Two-level hash function.

concatenated bits use d as a key in a B-tree

page with 68 kB of null characters

Mining the Web

Chakrabarti and Ramakrishnan

Spider Traps: Solutions

Mining the Web

Chakrabarti and Ramakrishnan

Avoiding repeated expansion of links on duplicate pages

 Detecting near-duplicates Even a single altered character will completely

Crawler To fetch the pages to the computer At the computer

Continue till he crawler has collected a sufficient number of pages.

Facilitates load distribution among many DNS servers.

Exploiting spatio-temporal locality of access

Detecting near-duplicates Even a single altered character will completely

Uses statistics from load monitor

Avoiding DOS limit the number of active requests to a given

Distribute attention relatively evenly between

Access locality vs. politeness dilemma

includes fields like content-type, last-modified

`T3' grade leased lines.

Patched intermediate indices into master