Vous êtes sur la page 1sur 17

Internet research

M. Vladoiu, C. Negoita
Why anyone should use Internet research?

• opportunity to gain an important advance

over their competitors
• a wealth of information on countless topics
• access to a wide variety of services:
vast information sources, electronic mail,
file transfer, interest group membership,
interactive collaboration, multimedia
displays, and more…
Issues to deal with
o While doing research on the internet
the searcher has to deal with:
• large number of founded entries
• trustworthy information on the web
• deep web
Large number of founded entries
• ability to reduce the number of founded
entries and to find the needed information
on the Internet is a function of how precise
the queries are and how effectively one
uses search services
• poor queries return poor results;
good queries return great results
very effective ways to "structure" a query
and use special operators to target the
results you seek
Guidelines to good queries (1)
• use nouns and objects as query keywords –
actions (verbs), modifiers (adjectives, adverbs,
predicate subjects), and conjunctions are either
"thrown away" by the search engines or too
variable to be useful (e.g. planet or planets);
• use 6 to 8 keywords in a query - more
keywords, chosen at appropriate level, can
reduce the universe of possible documents
returned by 99% or more;
• truncate words to pick up singular and plural
versions – use asterisk wildcard (e.g. planet*).
The wildcard tells the search engine to match all
characters after it, preserving keyword slots and
increasing coverage by 50% or more;
Guidelines to good queries (2)
• use synonyms via the OR operator - cover the likely
different ways a concept can be described; generally
avoid OR in other cases;
• combine keywords into phrases where possible - use
quotes to denote phrases (“solar system”). Phrases
restrict results to EXACT matches; if combining terms is
a natural marriage, narrows and targets results by many
• combine 2 to 3 concepts in query - triangulating on
multiple query concepts narrows and targets results,
generally by more than 100-to-1 ("solar system", "new
planet*", discover* OR find);
• distinguish concepts with parentheses - nest single query
"concepts" with parentheses. Simple way to ensure the
search engines evaluate your query in the way you want,
from left to right – e.g. ("solar system") ("new planet*")
(discover* OR find);
Guidelines to good queries (3)
• order concepts with subject first - put main subject
first. Engines tend to rank documents more highly
that match first terms or phrases evaluated ("new
planet*") (discover* OR find) ("solar system");
• link concepts with the AND operator - AND glues the
query together. The resulting query is not overly
complicated nor nested, and proper left-to-right
evaluation order is ensured ("new planet*") AND
(discover* OR find) AND ("solar system");
• issue query to full Boolean search engine or
metasearcher - full-Boolean engines give you this
control; metasearchers increase Web coverage by
3- to 4-fold ("new planet*") AND (discover* OR find)
AND ("solar system")
Trustworthy information on web (C)
• Credibility : trustworthy source, author’s
credentials, evidence of quality control,
known or respected authority,
organizational support.
 Goal: an authoritative source, a source
that supplies some good evidence that
allows you to trust it.
Trustworthy information on web (A)
• Accuracy: up to date, factual, detailed,
exact, comprehensive, audience and
purpose reflect intentions of completeness
and accuracy.
 Goal: a source that is correct today (not
yesterday), a source that gives the whole
Trustworthy information on web (R)
• Reasonableness: fair, balanced,
objective, reasoned, no conflict of interest,
absence of fallacies or slanted tone.
 Goal: a source that engages the subject
thoughtfully and reasonably, concerned
with the truth.
Trustworthy information on web (S)
• Support: listed sources, contact
information, available corroboration,
claims supported, documentation
 Goal: a source that provides convincing
evidence for the claims made, a source
you can triangulate (find at least two other
sources that support it).
• searching on the Internet today can be
compared to dragging a net across the
surface of the ocean;
• while a great deal may be caught in the
net, there is still a wealth of information
that is deep, and therefore, missed;
• the reason is simple: most of the Web's
information is buried far down on
dynamically generated sites, and standard
search engines never find it.
• traditional search engines create their
indices by spidering/crawling surface Web
• to be discovered, the page must be static
and linked to other pages;
• traditional search engines can not "see" or
retrieve content in the deep Web - those
pages do not exist until they are created
dynamically as the result of a specific
• The Deep Web is qualitatively different from the
surface Web. Deep Web sources store their
content in searchable databases that only
produce results dynamically in response to a
direct request;
• public information on the deep Web is currently
400 to 600 times larger than the commonly
defined World Wide Web. The deep Web contains
9,500 terabytes of information compared to
around twenty terabytes of information in the
surface Web. More than half of the deep Web
content resides in topic-specific databases.
• a full 95% of the deep Web is publicly accessible
information - not subject to fees or subscriptions.
Total quality content of the deep Web is 1,000 to
2,000 times greater than that of the surface Web;
• a direct query is a "one at a time" laborious way to
search. BrightPlanet's search technology
automates the process of making dozens of direct
queries simultaneously using multiple-thread
technology and thus is the only search technology,
so far, that is capable of identifying, retrieving,
qualifying, classifying, and organizing both "deep"
and "surface" content.
The searchable databases on the web can be classified in 12 categories:
1. Topic Databases - subject-specific aggregations of information, such as SEC corporate filings,
medical databases, patent records etc. (54% from the deep web is formed by these topic
databases websites); e.g. http://www.10kwizard.com/, http://www.uspto.gov/
2. Internal site - searchable databases for the internal pages of large sites that are dynamically
created, such as the knowledge base on the Microsoft site (13%); e.g.
3. Publications - searchable databases for current and archived articles (11%); e.g. http://www.
4. Shopping/Auction (5%);
e.g. http://www.flowerweb.nl/, http://www.locateaflowershop.com/
5. Classifieds (5%) e.g. www.canadaeast.com/
6. Portals - broader sites that included more than one of these other categories in searchable
databases (3%); e.g. www.searchindia.com
7. Library - searchable internal holdings, mostly for university libraries (2%);
e.g. www.lib.clemson.edu
8. Yellow and White Pages - people and business finders (2%; e.g. www.anywho.com
9. Calculators - while not strictly databases, many do include an internal data component for
calculating results. Mortgage calculators, dictionary look-ups, and translators between
languages are examples (2%); e.g. www.russiantranslation.ru
10. Jobs - job and resume postings (1%); e.g. http://www.medicsolve.com/
11. Message or Chat (1%); e.g. www.multidbexpress.com
12. General Search - searchable databases most often relevant to Internet search topics and
information (1%); e.g. www.cyndislist.com
• Deep Web sites tend to be narrower, with deeper
content, than conventional surface sites.
• To put these findings in perspective it has to consider
that the search engines with the largest number of Web
pages indexed (such as Google) index no more than
sixteen per cent of the surface Web.
• Since they are missing the deep web when they use
such search engines, Internet searchers are therefore
searching only 0.03% - or one in 3,000 - of the pages
available to them today.
• Clearly, simultaneous searching of multiple surface and
deep Web sources is necessary when comprehensive
information retrieval is needed.