Vous êtes sur la page 1sur 13

Preface

This document provides an overview of the relatively important issues in


information retrieval using search engines. Not every topic is covered at the
same level of detail and the topics covered are done so by trying to make it as
easy and understandable as possible. We focused mainly on what we
considered to be the most basic and important topics.

Web search engines are obviously a major topic, and we base our
coverage primarily on the technology we all use on the Web and rather less on
the search engines that exist on a Local area networks and personal basis.

That been said, this document is divided into four major topics. The first
topic is the introduction, definition and types of search engines. The next topic
deals with the history of search engines. The third portion talks about how search
engines actually work to store and retrieve data. And finally the last portion
concludes the topic with the issues surrounding search engines.

1
Table of content

2
Search Engines and Information Retrieval
Information retrieval is a field concerned with the structure, analysis, organization,
storage, searching, and retrieval of information. Despite the huge advances in the
understanding and technology of search in the past 40 years, this definition is still
appropriate and accurate. Information retrieval is often abbreviated as IR. The primary
focus of the field since the 1950s has been on text and text documents, Web pages,
email, scholarly papers, books, and news stories are just a few of the many examples of
documents.

In addition to a range of media, information retrieval involves a range of tasks


and applications. The usual search scenario involves someone typing in a query to a
search engine and receiving answers in the form the search engine.

What is a Search Engine?


The good news about the Internet and its most visible component, the World
Wide Web, is that there are hundreds of millions of pages available, waiting to present
information on an amazing variety of topics. The bad news about the Internet is that
there are hundreds of millions of pages available, most of them titled according to the
whim of their author, almost all of them sitting on servers with cryptic names. When you
need to know about a particular subject, how do you know which pages to read? If
you're like most people, you visit an Internet search engine.

It’s possible to think of the internet as the world’s biggest library but instead of
books, its shelves contain billions of individual web pages. Imagine being in such a vast
library. It would take forever to find what you were looking for. Every library has an index
to help you track down the book you want. The internet has something similar in the
form of ‘search engines’.

Search engines are special websites that have indexed billions of pages - and
make it easy for you to find a website or page in an instant. Popular search engines
include Google, Yahoo!, Bing and Ask.

To get to a search engine you just need to go to your browser’s address bar and
type in the address of the search engine website, or you can use the search box that’s
usually found in the top right-hand corner of a browser.

Each search engine works in a similar way. If you go to a search engine’s


homepage, you’ll find a single box. You simply type whatever you want to search for
into that box.

3
The illustration below shows the top 5 search engines in the US Jan 2014 to
Jan 2015

As you can see above Google holds the majority share at 78%, but about 22% of
search engine usage is for the other ‘major’ search engines.

So when we talk about search engines today, we can almost use the words
‘Google’ and ‘search engine’ interchangeably. In fact, the word ‘search’ in common
language has been replaced with ‘Google’ – “I’ll just Google the answer or “I just
Goggled it".

Types of search engines


The term "search engine" is often used generically to describe both crawler-
based search engines and human-powered directories. These two types of search
engines gather their listings in radically different ways.

Crawler-Based Search Engines


Crawler-based search engines, such as Google, create their listings
automatically. They "crawl" or "spider" the web, then people search through what they
have found.
If you change your web pages, crawler-based search engines eventually find these
changes, and that can affect how you are listed. Page titles, body copy and other
elements all play a role.
Human-Powered Directories
A human-powered directory, such as the Open Directory, depends on humans
for its listings. You submit a short description to the directory for your entire site, or editors

4
write one for sites they review. A search looks for matches only in the descriptions
submitted.
Changing your web pages has no effect on your listing. a good site, with good content,
might be more likely to get reviewed for free than a poor site.

"Hybrid Search Engines" Or Mixed Results


In the web's early days, it used to be that a search engine either presented
crawler-based results or human-powered listings. Today, it extremely common for both
types of results to be presented. Usually, a hybrid search engine will favor one type of
listings over another. For example, MSN Search is more likely to present human-powered
listings from LookSmart. However, it does also present crawler-based results (as provided
by Inktomi), especially for more obscure queries.

Is search engine only concerned with the web?


The answer to that is simply no. As it was pointed out in the preface search
engines now a days are almost always used to mean to the web search engines. There
are "non-web" related search engines such as :
 Enterprise search which involves finding the required informationin the huge
variety of computer files scattered across a corporate intranet
 Desktop search is the personal version of enterprise search, where the
information sources are the files stored on an individual computer, including
email messages and web pages that have recently been browsed.
 Peer-to-peer search involves finding information in networks of nodes or
computers without any centralized control.
This are some of the other types of searches that are "non-web" related.

5
History Of Search Engines
History of Search Engine can be said as started in A.D. 1990. The very first tool
used for searching on the Internet was Archie. It was created in 1990 by Alan Emtage
The Archie Database was made up of the file directories from hundreds of systems.
When you searched this Archie Database on the basis of a file‘s name, Archie could tell
you which directory paths on which systems hold a copy of the file you want. Archie did
not index the contents of these sites. This Archie Software, periodically reached out to
all known openly available ftp sites, list their files, and build a searchable index. The
commands to search Archie were UNIX commands, and it took some knowledge of
UNIX to use it to its full capability.

Later in A.D. 1991 Gopher came into the scene. Gopher was a menu system that
simplified locating and using Internet resources. Gopher was designed for distributing,
searching, and retrieving documents over the Internet. The rise of Gopher led to two
new search programs, Veronica and Jughead. Like Archie, they searched the file
names and titles stored in Gopher index systems.

Then comes W3Catalog in 1993 as one of the first search engines that
attempted to provide a general searchable catalog for WWW resources. Unlike later
search engines, like Aliweb, which attempt to index the web by crawling over the
accessible content of web sites, W3Catalog exploited the fact that many high-qualities,
manually maintained lists of web resources were already available.

 It should be noted that there are many other search engines that emerged
after the 90's.
Some of them will be mentioned in name basis and the remaining others are
considered major and are explained in details.
 Aliweb - (1993)
 Jump Station - (1993)
 Excite - (1995)
 Dogpile
 HotBot
 Teoma, Vivisimo - (1999-2000)
 Wikiseek, Guruji, Sproose And Blackle - (2006-2007)
 Powerset, Picollator, Viewzi - (2008)
 Cuil, LeapFish, Forestle, Valdo - (2008)
 Sperse, Yebol, Goby - (2009-2010)
 Exalead - (2011)
WebCrawler - (1994)
Brian Pinkerton, a CSE student at the University of Washington, starts WebCrawler
in his spare time. At first, WebCrawler was a desktop application, not a Web service as it
is today. WebCrawler went live on the Web with a database containing pages from just
over 4000 different Web sites. WebCrawler was the first Web search engine to provide
full text search. It went live on April 20, 1994 and was created by Brian Pinkerton at the
University of Washington.

6
The WebCrawler was unique in that it was the first web robot that was capable
of indexing every word on a web page, while other bots were storing a URL, a title and
at most 100 words.

MetaCrawler - (1995)
The concept of Meta-Search Engine came into existence in which a single
interface provided search result that was generated by multiple search engines rather
than a single Search Engine Algorithm. Daniel Dreilinger at Colorado State University
developed Search Savvy which let users searched up to 20 different search engines at
one and a number of directories.

AltaVista - (1995)
AltaVista was once one of the most popular search engines but its popularity
waned with the rise of Google. The two key participants who created the engine were
Louis Monier, who wrote the crawler, and Michael Burrows, who wrote the indexer.
AltaVista was backed by the most powerful computing server available. AltaVista was
the fastest search engine and could handle millions of hits a day without any
degradation.

One key change that came with AltaVista was the inclusion of a natural
language search. Users could type in a phrase or a question and get an intelligent
response. For instance, ―Where is London?‖ without getting a million-plus pages
referring to "where" and 'is".

Ask Jeeves & Northern Light - (1996-1997)


Ask Jeeves (Ask) was a search engine founded in 1996 by Garrett Gruener and
David Warthen in Berkeley, California. The original idea behind AskJeeves was to allow
users to get answers to questions posed in everyday, natural language, as well as
traditional keyword searching. The current Ask.com still supports this, with added
support for math, dictionary, and conversion questions.

Google - (1998)
Google had its rise to success in large part due to a patented algorithm called
PageRank that helps rank web pages that match a given search string. Previous
keyword-based methods of ranking search results, used by many search engines would
rank pages by how often the search terms occurred in the page, or how strongly
associated the search terms were within each resulting page. The PageRank algorithm
used by Google, instead analyses human-generated links, assuming that web pages
linked from many important pages are themselves likely to be important.

Google algorithm computes a recursive score for pages, based on the weighted
sum of the Page Ranks of the pages linking to them. PageRank is thought to correlate
well with human concepts of importance. In addition to PageRank, Google over the
years has added many other secret criteria for determining the ranking of pages on
result lists, reported to be over 200 different indicators. The exact percentage of the
total of web pages that Google indexes are not known, as it is very hard to
actually calculate. Google not only indexes and caches web pages but also
7
takes ―snapshots‖ of other file types, which include PDF, Word documents, Excel
spreadsheets, Flash SWF, plain text files, and so on.

Yahoo! Search - (2004)


Yahoo! Search is a web search engine, owned by Yahoo! Inc. Originally, Yahoo!
Search started as a web directory of other websites, organized in a hierarchy, as
opposed to a searchable index of pages. In the late 1990s, Yahoo! evolved into a full-
fledged portal with a search interface. Yahoo! Search, originally referred to as Yahoo!

In 2003, Yahoo! purchased Overture Services, Inc., which owned the AlltheWeb
and AltaVista search engines. Initially, even though Yahoo! owned multiple search
engines, they didn‘t use them on the main yahoo.com website, but kept using Google‘s
search engine for its results. Starting in 2003, Yahoo! Search became its own web
crawler-based search engine, with a reinvented crawler called Yahoo! Slurp. Yahoo!
Search combined the capabilities of all the search engine companies they had
acquired, with its existing research, and put them into a single search engine. Sogou
was a Chinese search engine which can search text, images, music, and maps. It was
launched 4 August 2004.

Bing - (2009)
Bing (formerly Live Search, Windows Live Search, and MSN Search) is a web
search engine (advertised as a "decision engine") from Microsoft. Bing was unveiled by
Microsoft CEO Steve Ballmer on May 28, 2009 at the All Things Digital conference in San
Diego. It went fully online on June 3, 2009, with a preview version released on June 1,
2009. Notable changes include the listing of search suggestions as queries are entered
and a list of related searches (called "Explorer pane") based on semantic technology
from Powerset that Microsoft purchased in 2008.

8
How Do Web Search Engines Work?
Web crawling
There are differences in the ways various search engines work, but they all
perform three basic tasks:
• They search the Internet -- or select pieces of the Internet -- based on important
words.
• They keep an index of the words they find, and where they find them.
• They allow users to look for words or combinations of words found in that index.

 Since the most prominent search engine nowadays is Google we are going to
see how the search engine of it and similar other search engines work.

Before a search engine can tell you where a file or document is, it must be
found. To find information on the hundreds of millions of Web pages that exist, a search
engine employs special software robots, called spiders, to build lists of the words found
on Web sites. When a spider is building its lists, the process is called Web crawling.

Search Engines for the general web do not really search the World Wide Web
directly. Each one searches a database of web pages residing on servers. Search
engine databases are selected and built by this computer robot

These "crawl" the web, finding pages for potential inclusion by following the links
in the pages they already have in their database (i.e., already "know about"). They
cannot think or type a URL or use judgment to "decide" to go look something up and
see what's on the web about it.

How does any spider start its travels over the Web? The usual starting points are
lists of heavily used servers and very popular pages. The spider will begin with a popular
site, indexing the words on its pages and following every link found within the site. In this
way, the spidering system quickly begins to travel, spreading out across the most widely
used portions of the Web.

Google.com began as an academic search engine. In the paper that describes


how the system was built, Sergey Brin and Lawrence Page give an example of how
quickly their spiders can work. They built their initial system to use multiple spiders, usually
three at one time. Each spider could keep about 300 connections to Web pages open
at a time. At its peak performance, using four spiders, their system could crawl over 100
pages per second, generating around 600 kilobytes of data each second.
When the Google spider looked at a page, it took note of two things:
• The words within the page
• Where the words were found

9
The spider returns to the site on a regular basis, such as every month or two, to look for
changes.

Everything the spider finds goes into the second part of the search engine, the
index. The index, sometimes called the catalog, is like a giant book containing a copy
of every web page that the spider finds. If a web page changes, then this book is
updated with new information.
All crawler-based search engines have the basic similar parts, but there are
differences in how these parts are tuned. That is why the same search on different
search engines often produces different results.

This diagram shows the basic representation of web crawling.

So, how do crawler-based search engines go about determining relevancy, when


confronted with hundreds of millions of web pages to sort through? They follow a set of
rules, known as an algorithm. Exactly how a particular search engine's algorithm works is
a closely-kept trade secret. However, all major search engines follow the general rules
below.

10
Location, Location, Location...and Frequency

One of the main rules in a ranking algorithm involves the location and frequency
of keywords on a web page. Call it the location/frequency method, for short.
Pages with the search terms appearing in the HTML title tag are often assumed to be
more relevant than others to the topic. Search engines will also check to see if the
search keywords appear near the top of a web page, such as in the headline or in the
first few paragraphs of text. They assume that any page relevant to the topic will
mention those words right from the beginning.
Frequency is the other major factor in how search engines determine relevancy. A
search engine will analyze how often keywords appear in relation to other words in a
web page. Those with a higher frequency are often deemed more relevant than other
web pages.
Search engines may also penalize pages or exclude them from the index, if they
detect search engine "spamming." An example is when a word is repeated hundreds of
times on a page, to increase the frequency and propel the page higher in the listings.
Search engines watch for common spamming methods in a variety of ways, including
following up on complaints from their users.
Building the Organic Index
● For each page retrieved the web crawlers extract the text
– For each term in the text, add the page's ID (and optionally, positions) to the list of
docs for that term

As you can see above the search engines store the number of times a word is
mentioned in a page or document to measure the relevancy based on that word.
Meta Tags
Meta tags allow the owner of a page to specify key words and concepts under
which the page will be indexed. This can be helpful, especially in cases in which the
words on the page might have double or triple meanings. The meta tags can guide the
search engine in choosing which of the several possible meanings for these words is
correct

11
In the example above, you can see the beginning of the page's "head" area as noted
by the HEAD tag -- it ends at the portion shown as /HEAD. Meta tags go in between the
"opening" and "closing" HEAD tags. Shown in the example is a TITLE tag, then a META
DESCRIPTION tag, then a META KEYWORDS tag.
The meta keywords tag allows you to provide additional text for crawler-based search
engines to index along with your body copy.
The Meta Description is supported by all the major crawlers support the meta
description tag, to some degree.

ALT Text / Comments


This shows which search engines index ALT text associated with images or text in
comment tags.

Matching the Search Query


The search query is everything that the user types to get results. It is made up of
one or more search terms, plus optional special characters.
Analyzing the Query
In general has three phases
1. stop word removal- removing unnecessary words like the, a, what, in etc
2. stemming - condensing the word into its original root word. example playing is
stemmed to play.
3. ranking
After the analysis is done the search engine take the remaining query and tries
to match it with the indexed pages.

Ranking Organic Matches


This is a complex, active research area
– the goal is to sort matching results from 'best' to 'worst'
– Many factors contribute to different rankings in the various engines
– Ranking functions are under continuous change
Primary factors
– Text analysis: keyword density and prominence
 A.k.a. keyword weight
 Generally refers to the relative frequency of a term on the page
 Higher keyword density generally means that a document is more 'about' that
keyword
 Multi-term queries target keyword proximity Pages with the same terms adjacent
in same order would benefit most
 Good places include:

12
– Title
– Headings
– Start of body
-Terms in such places could get extra weight

– Link analysis: page and site authority estimates

 A typical short query matches millions of pages and Many could even have the
same textual (relevance) weight from keyword density and prominence
 Link analysis estimates the importance of each page, based on the link
structure around it
 The more respected a site is, the more links point to it.
The best-known link analysis algorithm is googles pagerank algorithm published in 1998.
Very well-studied and improvements are still being made to it today. its designed if the
authoritativeness of a page grows if more pages link to it and the pages that link to it
increase their authority. The original algorithm is not a significant component of
Google's ranking approach today.

13

Vous aimerez peut-être aussi