Académique Documents
Professionnel Documents
Culture Documents
The “whole web crawl” breaks that down to its constituent steps; here’s one I did:
nutch inject crawl/crawldb seed
nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s1
nutch updatedb crawl/crawldb $s1
nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s2
nutch updatedb crawl/crawldb $s2
nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s3
nutch updatedb crawl/crawldb $s3
nutch invertlinks crawl/linkdb -dir crawl/segments
nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output .
Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the
original with the newly merged directory:
3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and
then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”,
make sense rather than recreating linkdb afresh from the whole set of segments every time? In other
words, can invertlinks work incrementally, or does it need to have a view of all segments in order to
work correctly?
[That's the same as the recrawl script from the wiki. --Kai]
Q: I have another question, I done what you give me… But it inject the
new urls and “recrawl” it, but against the first crawl It doesn’t
download the web pages and really crawl them… perhaps I’m mistaking
somewhere…Any idea ?
Q: But the websites just added hasn’t been yet crawled… And they’re not
crawled during recrawl…
Does “bin/nutch purge” will restart all ?
A: This command “bin/nutch purge” doesn’t exist. Well I can’t say you what is
happening. Give me the output when you run the recrawl.
The above is the usage printed for “nutch inject” on the command line. And now from nutch-default.xml:
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-fetches of a page.
</description>
</property>
I crawl my intranet with a depth of 2. Later, I recrawl using the script found below:
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
[the standard script –Kai]
In my recrawl, I also specify a depth of 2. It reindexes each of the pages before, and if they have
changed update the pages content. If they have changed and new links exist, the links are followed
to a maximum depth of 2.
This is how I think a typical recrawl should work. However, when I recrawl using the script linked
to above, tons of new pages are indexed, whether they have changed or not. It seems as if I crawl
the content with a depth of 2, and then come back and recrawl with a depth of 2, it really adds a
couple of crawl depth levels and the outcome is that I have done a crawl with a depth of 4 (instead
of crawl with a depth of 2 and then just a recrawl to catch any new pages).
• invertlinks
• index
• dedup
• merge
Basically what made me wonder is that it took me 2 minutes to do the crawl. It’s taken me over 3
hours and still going to do the recrawl (same depth levels specified). After I recrawl once, I believe
it then speeds up.
I don’t know if that guy ever fixed his problem. He was doing the same thing as I except that he started initially
with an “intranet crawl” and built on it (I deleted my initial “intranet crawl” and recrawled incrementally).
I’m not sure if repetition will help me, but here’s another description of how crawl works – “Re: How to recrawl
urls” (Dec 2005):
The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool.
After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb,
marked by current date and time. Then, you create a fetch list using FetchListTool. The
FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished,
UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set
date and time to current date and time, while all just-crawled URLs date and time are set to next 30
days (these things happen actually in FetchListTool). So all extracted links will be crawled for the
next time, but not the just-crawled URLs. So on and soforth.
Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all “just-
crawled” urls will be taken out to recrawl. That’s why we need to maintain a live crawler at that
time. This could be done using cron job, I think.
Slightly further into the above thread, Stefan Groschupf suggests: “do the steps manually as described here:
SimpleMapReduceTutorial“; that tutorial, written by Earl Cahill in Oct 2005, has these steps (plus explanation):
cd nutch/branches/mapredmkdir urls
echo "http://lucene.apache.org/nutch/" > urls/urls
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt
./bin/nutch crawl urls
CRAWLDB=`find crawl-2* -name crawldb`
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
./bin/nutch fetch $SEGMENT
./bin/nutch updatedb $CRAWLDB $SEGMENT
LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
./bin/nutch invertlinks $LINKDB $SEGMENTS
mkdir myindex
ls -alR myindex
Here’s a somewhat basic discussion on merging: “Problem with merge-output” (Jun 2007)
Q: After recrawl several times, I have problem with the directory: merge-output. I have digged into
mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-
output_new, then mv merge-output_new to merge-output.
You might want to replace the second statement with a ‘mv’ statement to backup the
segments.
Here’s another: “Simple question about the merge tool” (Jul 2005):
Q: I have a simple question about how to use the merge tool. I’ve done three small crawls resulting
in three small segment directories. How can I merge these into one directory with one index? I
notice the merge command options:
I don’t really understand what it’s doing with the outputIndex and the segments. Will this
automatically delete segments after merging them into the output?
Mergesegs has the following usage as reported by running “nutch mergesegs” on the command line:
I’m curious about the usage of the merge command. Here’s a console session detailing these:
$ nutch | grep merg
mergedb merge crawldb-s, with optional filtering
mergesegs merge several segments, with optional filtering and slicing
mergelinkdb merge linkdb-s, with optional filtering
merge merge several segment indexes
$ nutch mergedb
Usage: CrawlDbMerger [ ...] [-normalize] [-filter]
output_crawldb output CrawlDb
crawldb1 ... input CrawlDb-s (single input CrawlDb is ok)
-normalize use URLNormalizer on urls in the crawldb(s) (usually not
needed)
-filter use URLFilters on urls in the crawldb(s)
$ nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN]
output_dir name of the parent dir for output segment slice(s)
-dir segments parent dir containing several segments
seg1 seg2 ... list of segment dirs
-filter filter out URL-s prohibited by current URLFilters
-slice NNNN create many output segments, each containing NNNN URLs
$ nutch mergelinkdb
Usage: LinkDbMerger [ ...] [-normalize] [-filter]
output_linkdb output LinkDb
linkdb1 ... input LinkDb-s (single input LinkDb is ok)
-normalize use URLNormalizer on both fromUrls and toUrls in
linkdb(s) (usually not needed)
-filter use URLFilters on both fromUrls and toUrls in linkdb(s)
$ nutch merge
Usage: IndexMerger [-workingdir ] outputIndex indexesDir...
Ah: the nutch javadoc has some comments on each of the above classes:
CrawlDbMerger - “nutch mergedb” – see also mergedb wiki
org.apache.nutch.crawl
Class CrawlDbMerger
java.lang.Object org.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.CrawlDbMerger
Configurable, Tool
This tool merges several CrawlDb-s into one, optionally filtering URLs through the current
URLFilters, to skip prohibited pages.
It’s possible to use this tool just for filtering – in that case only one CrawlDb should be specified in
arguments.
If more than one CrawlDb contains information about the same URL, only the most recent version
is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all
metadata information from all versions is accumulated, with newer values taking precedence over
older values.
Author:
Andrzej Bialecki
This tool takes several segments and merges their data together. Only the latest versions of data is
retained.
Also, it’s possible to slice the resulting segment into chunks of fixed size.
Important Notes
It doesn’t make sense to merge data from segments, which are at different stages of processing (e.g.
one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to
merging, the tool will determine the lowest common set of input data, and only this data will be
merged. This may have some unintended consequences: e.g. if majority of input segments are
fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists,
and it will skip all other data from all segments.
Merging fetchlists
Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because
this tool (unlike the Generator doesn’t ensure that fetchlist parts for each map task are disjoint.
Duplicate content
Merging segments removes older content whenever possible (see below). However, this is NOT the
same as de-duplication, which in addition removes identical content found at different URL-s. In
other words, running DeleteDuplicates is still necessary.
For some types of data (especially ParseText) it’s not possible to determine which version is really
older. Therefore the tool always uses segment names as timestamps, for all types of input data.
Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments
with “higher” names will prevail. It follows then that it is extremely important that segments be
named in an increasing lexicographic order as their creation time increases.
Merged segment gets a different name. Since Indexer embeds segment names in indexes, any
indexes originally created for the input segments will NOT work with the merged segment. Newly
created merged segment(s) need to be indexed afresh. This tool doesn’t use existing indexes in any
way, so if you plan to merge segments you don’t have to index them prior to merging.
Author:
Andrzej Bialecki
java.lang.Object org.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.LinkDbMerger
implements Reducer
This tool merges several LinkDb-s into one, optionally filtering URLs through the current
URLFilters, to skip prohibited URLs and links.
It’s possible to use this tool just for filtering – in that case only one LinkDb should be specified in
arguments.
If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but
only at most db.max.inlinks inlinks will ever be added.
If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a
target URL is prohibited, all inlinks to that target will be removed, including the target URL. If
some of incoming links are prohibited, only they will be removed, and they won’t count when
checking the above-mentioned maximum limit.
Author:
Andrzej Bialecki
Configurable, Tool
Author:
I wrote a post asking for clarification about the above four merge commands: “four nutch merge commands:
mergedb, mergesegs, mergelinkdb, merge” (Jul 2007).
Q: Naively: why are there four merge commands? Are some subsets of the others? Are they used in
conjunction? What are the usage scenarios of each?
mergedb: as its name does not imply, it is used to merge crawldb. So consider this
mergecrawldb
merge: Merges lucene indexes. After a index job, you end up with a indexes directory
with a bunch of part-<num> directories inside. Command merge takes such a directory
and produces a single index. A single index has a better performance (I think). You can
say that merge is poorly named, it should have been called mergeindexes or something.
So none of them is a subset of another. They all have different purposes. It is kind of
confusing to have a “merge” command that only merges indexes, so perhaps we can add
a mergeindexes command, keep merge for some time (noting that it has been
deprecated) then remove it.
Q: It seems most of the nutch-user discussions I’ve seen so far relate to the simple merge command.
Are the first three “advanced commands”?
A: They serve different purpose – let’s assume that somehow you’ve got two crawldb-s,
e.g. you ran two crawls with different seed lists and different filters. Now you want to
take these collections of urls and create a one big crawl. Then you would use mergedb to
merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments
“Incremental indexing” (June 2007) discusses the complex aspects of recrawling/merging rather clearly. It’s too
bad nobody on nutch-user replied to it.
As the size of my data keeps growing, and the indexing time grows even faster, I’m trying to switch
from a “reindex all at every crawl” model to an incremental indexing one. I intend to keep the
segments separate, but I
want to index only the segment fetched during the last cycle, and then merge indexes and perhaps
linkdb. I have a few questions:
1. In an incremental scenario, how do I remove from the indexes references to segments that have
expired??
But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output .
Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the
original with the newly merged directory:
3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and
then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”,
make sense rather than recreating linkdb afresh from the whole set of segments every time? In other
words, can invertlinks work incrementally, or does it need to have a view of all segments in order to
work correctly?
Here’s a very current and rather complex question, with replies, titled “incremental growing index” (Jul 2007):
Q: Our crawler generates and fetches segments continuously. We’d like to index and merge each
new segment immediately (or with a small delay) such that our index grows incrementally. This is
unlike the normal situation where one would create a linkdb and an index of all segments at once,
after the crawl has finished. The problem we have is that Nutch currently needs the complete linkdb
and crawldb each time we want to index a single segment.
A: The reason for wanting the linkdb is the anchor information. If you don’t need any
anchor information, you can provide an empty linkdb.
The reason why crawldb is needed is to get the current page status information (which
may have changed in the meantime due to subsequent crawldb updates from newer
segments). If you don’t need this information, you can modify Indexer.reduce() (~line
212) method to allow for this, and then remove the line in Indexer.index() that adds
crawldb to the list of input paths.
Q: The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and
segment). This includes all data from the linkdb and crawldb that we actually don’t need since we
are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out
in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more
of a problem.
A: Is this really a problem for you now? Unless your segments are tiny, the indexing
process will be dominated by I/O from the processing of parseText / parseData and
Lucene operations.
Q: Any ideas on how to tackle this issue? Is it feasible to lookup the corresponding linkdb and
crawldb data for each key (url) in the segment before or during indexing?
A: It would be probably too slow, unless you made a copy of linkdb/crawldb on the
local FS-es of each node. But at this point the benefit of this change would be doubtful,
because of all the I/O you would need to do to prepare each task’s environment …
Q: Thanks Andrzej. Perhaps these numbers make our issue more clear:
- after a week of (internet) crawling, the crawldb contains about 22M documents.
- 6M documents are fetched, in 257 segments (topN = 25,000)
- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc)
- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc)
- size of a segment = somewhere between 100 and 500 MB (25K docs, 20 kB/doc
(max))
As you can see: for a segment of 500 MB, more than 99% of the IO during indexing is due to the
linkdb and crawldb. We could increase the size of our segments, but in the end this only delays the
problem. We are now indexing without the linkdb. This reduces the time needed by a factor 10. But
we would really like to have the link texts back in again in the future.
Here’s a thread I started a couple weeks back: “Interrupting a nutch crawl — or use topN?” (Jun 2007):
I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully
stop it (I don’t expect it to complete by then). Is there a way to do this? I want it to stop crawling
then build the lucene
index. Note that I used a simple nutch crawl command, rather than the “whole web” crawling
methodology:
“You can limit the number of pages by using the -topN parameter. This limits the
number of pages fetched in each round. Pages are prioritized by how well-linked they
are. The maximum number of pages that can be
fetched is topN*depth.”
-topN N determines the maximum number of pages that will be retrieved at each
level up to the depth.
excerpt:
There might be times when you would like to integrate Apache Nutch crawling with a
single Apache Solr index server – for example when your collection size is limited to
amount of documents that can be served by single Solr instance, or you like to do your
updates on “live” index. By using Solr as your indexing server might even ease up your
maintenance burden quite a bit – you would get rid of manual index life cycle
management in Nutch and let Solr handle your index.
I then issue a crawl of 10,000 URLs at a time, and just repeat the process for as long as the window
available. because I use solr to store the crawl results. It makes the index available during the crawl
window. But I’m a relative newbie as well, so look forward what the experts say.
I looked at Sami Siren’s script; it’s pretty much the same as what I did a the top of this blog, except his script
“will execute one iteration of fetching and indexing.” The script’s only real difference is that it
uses ’SolrIndexer’ (that you write) rather than the normal Indexer class,
org.apache.nutch.indexer.Indexer (here’s the Indexer javadoc). I think I guess correctly that Indexer is what
runs when you do “nutch index” from the command line. Just to beat a dead horse a bit more, here’s an excerpt
from Sami’s script:
bin/nutch inject $BASEDIR/crawldb urls
checkStatus
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS
checkStatus
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -f1|sort|tail
-1`
echo processing segment $SEGMENT
bin/nutch fetch $SEGMENT -threads 20
checkStatus
bin/nutch updatedb $BASEDIR/crawldb $SEGMENT -filter
checkStatus
bin/nutch invertlinks $BASEDIR/linkdb $SEGMENT
checkStatus
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb
$SEGMENT
checkStatus
checkStatus is just a short function in the script that looks to see if any errors were generated by
whatever command ran last. I also note that Sami is using a hadoop command that I don’t
understand; the NutchHadoopTutorial mentions ‘hadoop dfs’ … but I think I may be drifting off
topic.
Here is the other response I got to my post:
In the past Andrzej put some stuff related to your issue in the Jira. Try to
look it up there.
Found it http://issues.apache.org/jira/browse/NUTCH-368
In some cases it would be useful to be able to “signal” a job and its tasks
about some external condition, or to broadcast a specific message to all tasks
in a job. Currently we can only send a single pseudo-signal, that is to kill a
job.
This patch uses the message queueing framework to implement the following
functionality in Fetcher:
* ability to gracefully stop fetching the current segment. This is different from simply
killing the job in that the partial results (partially fetched segment) are available and can
be further processed. This is especially useful for fetching large segments with long
“tails”, i.e. pages which are fetched very slowly, either because of politeness settings or
the target site’s bandwidth limitations.
* ability to dynamicaly adjust the number of fetcher threads. For a long-running fetch
job it makes sense to decrease the number of fetcher threads during the day, and increase
it during the night. This can be done now with a cron script, using the MsgQueueTool
command-line.
It’s worthwhile to note that the patch itself is trivial, and most of the work is done by the
MQ framework.
After you apply this patch you can start a long-running fetcher job, check its <jobId>,
and control the fetcher this way:
This adjusts the number of threads to 50 (starting more threads or stopping some threads
as necessary).
Then run:
This will gracefully shut down all threads after they finish fetching their current url, and
finish the job, keeping the partial segment data intact.
Susam Pal has posted (Aug 2007) a new script to crawl with nutch 0.9:
#!/bin/sh
depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value
# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi
if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi
if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi
if [ -n "$topN" ]
then
topN="--topN $rank"
else
topN=""
fi
steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls
if [ "$safe" != "yes" ]
then
rm -rf crawl/NEWindexes
fi
echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
echo Done!
else
echo runbot: Can not reload index in safe mode.
echo runbot: Please reload it manually using the following command:
echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi
I am writing this blog in order to publicly document my exploration of the nutch crawler and get feedback about
what other folks have tried or discovered. I’ve already been using nutch for a few weeks so this blog doesn’t
start completely at the beginning for me, but I’ll try to be explanatory in how I write here. Like many open
source projects, nutch is poorly documented. This means that in order to find answers one has to make extensive
use of google plus comb the nutch forums: nutch-user and nutch-dev. (Those links are hosted at www.mail-
archive.com; they’re also hosted by www.nabble.com in a different format here: nutch-user and nutch-dev.) I’ve
found that people are pretty responsive on nutch-user. The nutch to-do list, bugs, and enhancements are
listed using JIRA software at issues.apache.org/jira/browse/Nutch.
Backdrop: I had latitude in making a choice of crawler/indexer, so in the beginning I read some general
literature such as “Crawling the Web” by Gautam Pant, Padmini Srinivasan, and Filippo Menczer. On
approaches to search the entertaining “Psychosomatic addict insane” (2007) discusses latent semantic indexing
and contextual network graphs. And let’s not forget spreading activation networks. Writing a crawler is not
easy so I looked at some java-based open source crawlers and started examining Heritrix. In a conversation
with Gordon Mohr of the internet archive I decided to go with nutch as he said Heritrix was more focused on
storing precise renditions of web pages and on storing multiple versions of the same page as it changes over
time. On the other hand, nutch just stores text, and it directly creates and accesses Lucene indexes whereas the
internet archive also has to use NutchWax to interact with Lucene.
The current version of nutch is 0.9; but rather than the main release I’m using one of the nightly builds that fixes
a bug I ran into (see the NUTCH-505 JIRA). The nightly build also has a more advanced RSS feed handler. But
I’m getting ahead of myself.
The best overall introductory article to nutch I’ve found so far is the following two-parter written by Tom White
in January of 2006. It has a brief overall description of nutch’s architecture, then delves into the specifics of
crawling a small example site; it tells how to set up nutch as well as tomcat, and what kind of sanity checks to
do on the results you get back.
• Introduction to Nutch, Part 1: Crawling
• Introduction to Nutch, Part 2: Searching.
On the architecture:
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and
turns them into an inverted index, which the searcher uses to answer users’ search queries. The
interface between the two pieces is the index, so apart from an agreement about the fields in the
index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the
page content is not stored in the index, so the searcher needs access to the segments [a collection of
pages fetched and indexed by the crawler in a single run] below in order to produce page summaries
and to provide access to cached pages.)
Another slide show (PDF) by Doug Cutting, ”Nutch, Open-Source Web Search“ shows the architecture:
Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then
present them. Finding a large relevant subset is normally done with an inverted index of the corpus;
ranking within that set to produce the most relevant documents, which then must be summarized for
display.
Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing
indexes.
Web DB: Stores the document contents for indexing and later summarization by the searcher, along
with information such as the link structure of the document space and the time each document was
last fetched.
Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been
written entirely from scratch.
There is a lengthy video presentation (71 minutes) with Doug Cutting, sponsored by IIIS in Helsinki, 2006. It
has an associated PDF slide show entitled “Open Source Platforms for Search“. The introduction has a
philosophical discourse on open source software then gets down to a meaty technical discussion after about
eight minutes. For instance, Doug discusses that with a single person as administrator, nutch scales well up to
about 100 million documents. Beyond that, billions of pages are “operationally onerous”.
One of the more widely linked articles articles by Doug Cutting and Mike Cafarella is “Building Nutch: Open
Source Search” (printer friendly version). On page 3 they outline nutch’s operational costs–note that these $
estimates were done in early 2004:
A typical back-end machine is a single-processor box with 1 gigabyte of RAM, a RAID controller,
and eight hard drives. The filesystem is mirrored (RAID level 1) and provides 1 terabyte of reliable
storage. Such a machine can be assembled for a cost of about $3,000…. A typical front-end machine
is a single-processor box with 4 gigabytes of RAM and a single hard drive. Such a machine can be
assembled for about $1,000…. Note that as traffic increases, front-end hardware quickly becomes
the dominant hardware cost.
A 2007 paper from IBM Research entitled “Scalability of the Nutch Search Engine” explores some blade server
configurations and uses mathematical models to conclude that nutch can scale well past the base cases they
actually run. Note that the paper is about the index/search aspect of nutch rather than the crawling.
Search workloads behave well in a scale-out environment. The highly parallel nature of this
workload, combined with a fairly predictable behavior in terms of processor, network and storage
scalability, makes search a perfect candidate for scale-out. Scalability to thousands of nodes is well
within reach, based on our evaluation that combines measurement data and modeling.
Lucene is the searching/indexing component of nutch; one of the things that attracted me to nutch was that I
would be able to have an end-to-end, customizable package to implement search. And either lucene or nutch
can be used for the query processing; nutch just has a simpler query syntax: it is optimized for the most
common web queries so it doesn’t support OR queries, for instance. There are other crawlers, such as Heritrix
which is very robust and is used by the internet archive, and other indexers like Xapian, which is very
performant. ‘Archiving “Katrina” Lessons Learned‘ was a project that chose to use Heritrix and NutchWax. For
now I’m happy with nutch+lucene. The one book I found that has much to say about Lucene (and even it has
only minimal coverage of nutch) is Lucene in Action by Erik Hatcher and Otis Gospodnetic. I should also
mention that the book has thorough coverage of Luke, a tool that is useful for playing with lucene indexes. The
apache lucene mailing lists in searchable form are java-user and java-dev. The lucene FAQ is frequently
updated.