How search engines work: crawling, indexing, and retrieval

For many, Google is the internet.
It’s the starting point for finding new sites, and is arguably the
most important invention since the internet itself. Without search engines, new web content
would be inaccessible to the masses.
But do you know how search engines work? Every search engine has three main functions:
crawling (to discover content), indexing (to track and store content), and retrieval (to fetch
relevant content when users query the search engine).
Crawling
Crawling is where it all begins: the acquisition of data about a website.
This involves scanning sites and collecting details about each page: titles, images,
keywords, other linked pages, etc. Different crawlers may also look for different details, like
page layouts, where advertisements are placed, whether links are crammed in, etc.
But how is a website crawled? An automated bot (called a “spider”) visits page after page as
quickly as possible, using page links to find where to go next. Even in the earliest days, Google’s
spiders could read several hundred pages per second. Nowadays, it’s in the thousands.
When a web crawler visits a page, it collects every link on the page and adds them to its list of
next pages to visit. It goes to the next page in its list, collects the links on that page, and repeats.
Web crawlers also revisit past pages once in a while to see if any changes happened.
This means any site that’s linked from an indexed site will eventually be crawled. Some sites are
crawled more frequently, and some are crawled to greater depths, but sometimes a crawler may
give up if a site’s page hierarchy is too complex.
One way to understand how a web crawler works is to build one yourself. We’ve written a
tutorial on creating a basic web crawler in PHP, so check that out if you have any programming
experience.
Note that pages can be marked as “noindex,” which is like asking search engines to skip its
indexing. Non-indexed parts of the internet are known as the “deep web”, and some sites, like
those hosted on the TOR network, can’t be indexed by search engines. (What is TOR and onion
routing?)
Indexing
Indexing is when the data from a crawl is processed and placed in a database.
Imagine making a list of all the books you own, their publishers, their authors, their genres, their
page counts, etc. Crawling is when you comb through each book while indexing is when you log
them to your list.
Now imagine it’s not just a room full of books, but every library in the world. That’s a
small-scale version of what Google does, who stores all of this data in vast data centers
with thousands of petabytes worth of drives.
Memory
Memory Sizes Explained: Gigabytes, Terabytes, and Petabytes in Context
Sizes Explained: Gigabytes, Terabytes, and Petabytes in Context It is easy to
see that 500GB is more than 100GB. But how do different sizes compare? What is a gigabyte to a terabyte?
Where does a petabyte fit in? Let's clear it up!
Here’s a peek inside one of Google’s search data centers:

I
Retrieval and Ranking

Retrieval is when the search engine processes your search query and returns the most relevant
pages that match your query.
Most search engines differentiate themselves through their retrieval methods: they use different
criteria to pick and choose which pages fit best with what you want to find. That’s why search
results vary between Google and Bing, and why Wolfram Alpha is so uniquely useful.
It took me some time to wrap my head around Wolfram Alpha and the queries it uses to spout out those results.
You have to dive deep into Wolfram Alpha to really exploit it to...READ MORE
Ranking algorithms check your search query against billions of pages to determine each one’s
relevance. Companies guard their ranking algorithms as patented industry secrets due to their
complexity. A better algorithm translates to a better search experience.
They also don’t want web creators to game the system and unfairly climb to the tops of search
results. If the internal methodology of a search engine ever got out, all kinds of people would
surely exploit that knowledge to the detriment of searchers like you and me.
Search engine exploitation is possible, of course, but isn’t so easy anymore.
Originally, search engines ranked sites by how often keywords appeared on a page, which led to
“keyword stuffing” — filling pages with keyword-heavy nonsense.
Then came the concept of link importance: search engines valued sites with lots of incoming
links because they interpreted site popularity as relevance. But this led to link spamming all over
the web. Nowadays, search engines weight links depending on the “authority” of the linking site.
Search engines put more value on links from a government agency than links from a link
directory.
Today, ranking algorithms are shrouded in more mystery than ever before, and “search engine
optimization” isn’t so important. Good search engine rankings now come from high-quality
content and great user experiences.
What’s next for Search Engines?

Ah, now there’s an interesting question. The answer is “semantics”: the meaning of the page’s
content. You can read more about in our overview of semantic markup and its future impact.
But here’s the gist of it.
Right now, you can search for “gluten-free cookies” but the results may return recipes for gluten-
free cookies. Instead, you might find regular cookie recipes that say “This recipe is not gluten-
free.” It has the right keywords, but the wrong meaning.
With semantics, you can search for cookie recipes and then remove certain ingredients: flour,
nuts, etc. You can also narrow down results to only recipes with prep times less than 30 minutes
and review scores of 4/5 or greater. That would be cool, right? That’s where we’re heading

How search engines work: crawling, indexing, and retrieval

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

How search engines work: crawling, indexing, and retrieval

Transféré par

Droits d'auteur :

Formats disponibles

For many, Google is the internet.

Here’s a peek inside one of Google’s search data centers:

Retrieval and Ranking

What’s next for Search Engines?

But here’s the gist of it.

Vous aimerez peut-être aussi