Vous êtes sur la page 1sur 1

A Summary of The Anatomy of a Large-Scale Hypertextual Web Search Engine

Nitin M. Bhanu
This paper can be considered as an important milestone in the evolution of the web as we see it today.
It gives a prototype implementation of a large scale search engine which makes heavy use of the link
structure present in hypertext. This paper was published in the infancy of all the modern search
engines. Most of the search engines during that time had problems related with reliability and
efficiency. Google was aimed at addressing the problems associated with the then existing search
engines. The major design goals of Google are related to improved search quality (only four of the top
ten search engines of that period could actually find itself), bring the search engine into use to a
reasonable number of people, build an architecture that can support novel research activities as the
Google stores all of the web documents it crawls in a compressed form and set up a Space-Lab like
environment, where users can propose and perform experiments.
System features were designed for high precision results. Google uses link structure of the web to
assess the Pagerank for each webpage and associate the search with an Anchor Text. PageRank
proved to be an excellent way to prioritise the web keyword searches. Pageranks are determined using
simple iterative algorithm. A page can have a high pagerank if there are many pages that point to it.
Anchor Text in Google not simply associates the text of a link with the page that the link is on, but
also with a page, the link points to. This feature enhances the quality and coverage of search of nontext information. Google was coded entirely in C or C++ for efficiency. Recent estimates say that the
Google repository contains 2 billion lines of code with a staggering size of 86 TB. It was coded to run
in either Solaris or Linux operating systems.
Web crawling is one of the most important and most fragile application under Google. The crawler
and the URL server are coded in Python. The crawler was designed to keep around 300 connections
open at once. A DNS cache was installed for each crawler to reduce performance stress. The URL
server sends a list of URLs to be fetched, crawler fetches them all and sends to the store server. Later,
it undergoes compression and is stored into repository. The zlib compression technique was used
considering appropriate trade-off between speed and compression ratio. Indexing the web is the next
stage which involves parsing, indexing web into barrels followed by sorting. Hits are very important
as far the architecture is concerned. They are a set of word occurrences and classified as fancy hit and
plain hit. Fancy hits include URLs, meta tags, titles and anchor texts, whereas all other occurrences
are considered plain hits. Plain hits keep an account of the capitalisation and font size which may be
important in the determination of the relative relevance of the word being searched for in a particular
web document. Searching as far is google is concerned should not compromise either the efficiency or
the quality. The searching procedure starts with parsing of a query followed by the conversion of
words into WordIDs. Then the engine seeks the start of a doclist and further scanning until there is a
matched document. It then ranks the document with an IR score using a weighted system of ranking
based on hit type and count weight. If in a short barrel and at the end of a doclist, it will further seek
the start of the doclist in full barrel for every word, else it will continue scanning the doclist. Finally,
once these steps are over, the documents matched by ranks are sorted. In multi word searches, the
proximity of the words was also considered.
Results and performance were much better than any of the existing search engines. The pagerank
system, the anchor text and proximity of words (in case of a multi word search) proved to be fruitful
in the development of a highly reliable and efficient search engine that could answer queries in a time
less than 10 seconds. The authors predicted that the solutions were scalable to commercial volumes
with some amount of effort which proved to be true in the last one and half decade.

Vous aimerez peut-être aussi