Vous êtes sur la page 1sur 18

Recrawling and merging

July 13, 2007 — nutch


As I mentioned in my introductory blog entry, I have already set up a working nutch installation and
crawled/indexed some documents.
Now I have a different question: how can I evolve a corpus over time? Basically I want to start with a group of
seed URLs and do a nutch crawl. There are two methodologies I know of so far: I’m not sure whether I want to
do an “intranet crawl” or a “whole web crawl“. The first uses the “nutch crawl” command:
Usage: Crawl <urlDir> [-dir d] [-threads n] [-depth i] [-topN N]

The “whole web crawl” breaks that down to its constituent steps; here’s one I did:
nutch inject crawl/crawldb seed
nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s1
nutch updatedb crawl/crawldb $s1
nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s2
nutch updatedb crawl/crawldb $s2
nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
nutch fetch2 $s3
nutch updatedb crawl/crawldb $s3
nutch invertlinks crawl/linkdb -dir crawl/segments
nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*

The essence of the above is:


1. inject
2. loop on these:
1. generate
2. fetch2
3. updatedb
3. invertlinks
4. index
So far I’ve built a corpus of several thousand documents. How should I add to it?
To be clear, I am, a bit, conflating two issues. Recrawling and merging are two separate operations. Recrawling
seeks to go through the existing pages and update them. The wiki has a recrawl script (which is unfortunately
not updated for version 0.9; whether it’s still good for 0.9 isn’t clear). Alternately, merging seeks to combine
two (usually?/mostly?) disjoint sets of documents and attendant indexes. Merging has the MergeCrawl script
which is detailed on the wiki, again, only through version 0.8. Neither of these scripts is in the distribution, why
is that?
Neither of “recrawl” or “merge” are mentioned in the nutch tutorial.
I did a search for “merge” on nutch-user; I also did a search for “recrawl”. Then I followed a few of the threads:
“Nutch Crawl Vs. Merge Time Complexity” (Mar 2006) asks:
I’m using Nutch v0.7 and I’ve been running nutch on our company unix system and it was setup to
crawl our intranet sites for updates daily, I’ve tried using the Merge, dedup, updatedb, and etc…I’d
notice the time complexity and efficiency was less productive than doing a fresh new crawl. For
example if I have two separate crawls from two different domains such as hotmail and yahoo, what
would the time complexity for nutch to crawl this two domains and then do a merge compare to just
doing a single full crawl of both domains? My guess would be that it will take nutch the same
amount of times to do either one, if that is so is there a reason to use the Merge at all?

“Incremental indexing” (Jun 2007) asks:


As the size of my data keeps growing, and the indexing time grows even
faster, I’m trying to switch from a “reindex all at every crawl” model to an
incremental indexing one. I intend to keep the segments separate, but I
want to index only the segment fetched during the last cycle, and then merge
indexes and perhaps linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references


to segments that have expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that


I can call “bin/nutch merge” with only two parameters: the original index
directory as destination, and the directory to be merged in the former:

$nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output .
Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the
original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just in case


it's already there
$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
rm -rf $index_dir.old # just in case it's already there
mv $index_dir $index_dir.old
mv $merged_indexes $index_dir
rm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and
then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”,
make sense rather than recreating linkdb afresh from the whole set of segments every time? In other
words, can invertlinks work incrementally, or does it need to have a view of all segments in order to
work correctly?

“Recrawl URLS” (Aug 2006) has a discussion between two people:


Q: I was searching for the method to add new url to the crawling url list
and how to recrawl all urls…

A: You could use the command bin/nutch inject $nutch-dir/db -urlfile


urlfile.txt. To recrawl your WebDB you can use this
script.http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

[That's the same as the recrawl script from the wiki. --Kai]

Take a look to the adddays argument and to the configuration property


db.default.fetch.interval. They influence to the result.

Q: I have another question, I done what you give me… But it inject the
new urls and “recrawl” it, but against the first crawl It doesn’t
download the web pages and really crawl them… perhaps I’m mistaking
somewhere…Any idea ?

A: In the nutch conf/nutch-default.xml configuration file exist a property call


db.default.fetch.interval. When you crawl a site, nutch schedules the next
fetch to “today + db.default.fetch.interval” days. If you execute the recrawl
command and the pages that you fetch don’t reach this date, they won’t be
re-fetched. When you add new urls to the webdb, they will be ready to be
fetch. So at this moment only this pages will be fetched by the recrawl
script.

Q: But the websites just added hasn’t been yet crawled… And they’re not
crawled during recrawl…
Does “bin/nutch purge” will restart all ?

A: This command “bin/nutch purge” doesn’t exist. Well I can’t say you what is
happening. Give me the output when you run the recrawl.

I found that a bit inconclusive. Points of interest:


$ nutch inject Usage: Injector <crawldb> <url_dir>

The above is the usage printed for “nutch inject” on the command line. And now from nutch-default.xml:
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-fetches of a page.
</description>
</property>

Ok, great, that’s deprecated. I really need some current documentation!


“Recrawling… Methodology?” (Jul 2006) asks:
I need some help clarifying if recrawling is doing exactly what I think it is. Here’s the current
scenario of how I think a recrawl should work:

I crawl my intranet with a depth of 2. Later, I recrawl using the script found below:
http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
[the standard script –Kai]

In my recrawl, I also specify a depth of 2. It reindexes each of the pages before, and if they have
changed update the pages content. If they have changed and new links exist, the links are followed
to a maximum depth of 2.

This is how I think a typical recrawl should work. However, when I recrawl using the script linked
to above, tons of new pages are indexed, whether they have changed or not. It seems as if I crawl
the content with a depth of 2, and then come back and recrawl with a depth of 2, it really adds a
couple of crawl depth levels and the outcome is that I have done a crawl with a depth of 4 (instead
of crawl with a depth of 2 and then just a recrawl to catch any new pages).

The current steps of the recrawl are as follows:


for (how many depth levels specified)
$nutch_dir/nutch generate $webdb_dir $segments_dir -adddays $adddays
segment=`ls -d $segments_dir/* | tail -1`
$nutch_dir/nutch fetch $segment
$nutch_dir/nutch updatedb $webdb_dir $segment

• invertlinks
• index
• dedup
• merge

Basically what made me wonder is that it took me 2 minutes to do the crawl. It’s taken me over 3
hours and still going to do the recrawl (same depth levels specified). After I recrawl once, I believe
it then speeds up.

I don’t know if that guy ever fixed his problem. He was doing the same thing as I except that he started initially
with an “intranet crawl” and built on it (I deleted my initial “intranet crawl” and recrawled incrementally).
I’m not sure if repetition will help me, but here’s another description of how crawl works – “Re: How to recrawl
urls” (Dec 2005):
The scheme of intranet crawling is like this: Firstly, you create a webdb using WebDBAdminTool.
After that, you fetch a seed URL using WebDBInjector. The seed URL is inserted into your webdb,
marked by current date and time. Then, you create a fetch list using FetchListTool. The
FetchListTool read all URLs in the webdb which are due to crawl, and put them to the fetchlist.
Next, the Fetcher crawls all URLs in the fetchlist. Finally, once crawling is finished,
UpdateDatabaseTool extracts all outlinks and put them to webdb. Newly extracted outlinks are set
date and time to current date and time, while all just-crawled URLs date and time are set to next 30
days (these things happen actually in FetchListTool). So all extracted links will be crawled for the
next time, but not the just-crawled URLs. So on and soforth.

Therefore, once the crawler is still alive after 30 days (or the threshold that you set), all “just-
crawled” urls will be taken out to recrawl. That’s why we need to maintain a live crawler at that
time. This could be done using cron job, I think.

Slightly further into the above thread, Stefan Groschupf suggests: “do the steps manually as described here:
SimpleMapReduceTutorial“; that tutorial, written by Earl Cahill in Oct 2005, has these steps (plus explanation):
cd nutch/branches/mapredmkdir urls
echo "http://lucene.apache.org/nutch/" > urls/urls
perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt
./bin/nutch crawl urls
CRAWLDB=`find crawl-2* -name crawldb`
SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
./bin/nutch fetch $SEGMENT
./bin/nutch updatedb $CRAWLDB $SEGMENT
LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
./bin/nutch invertlinks $LINKDB $SEGMENTS
mkdir myindex
ls -alR myindex

Here’s a somewhat basic discussion on merging: “Problem with merge-output” (Jun 2007)
Q: After recrawl several times, I have problem with the directory: merge-output. I have digged into
mail archive and found some clue: you should use a new dir name for the new merge, e.g., merge-
output_new, then mv merge-output_new to merge-output.

A: This is something I usually do:-

$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*


rm -rf crawl/segments/*
mv crawl/MERGEDsegments/* crawl/segments

You might want to replace the second statement with a ‘mv’ statement to backup the
segments.

Here’s another: “Simple question about the merge tool” (Jul 2005):
Q: I have a simple question about how to use the merge tool. I’ve done three small crawls resulting
in three small segment directories. How can I merge these into one directory with one index? I
notice the merge command options:

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir


<workingdir>] outputIndex segments...

I don’t really understand what it’s doing with the outputIndex and the segments. Will this
automatically delete segments after merging them into the output?

A: Use the bin/nutch mergesegs to merge many segments into one.

Mergesegs has the following usage as reported by running “nutch mergesegs” on the command line:
I’m curious about the usage of the merge command. Here’s a console session detailing these:
$ nutch | grep merg
mergedb merge crawldb-s, with optional filtering
mergesegs merge several segments, with optional filtering and slicing
mergelinkdb merge linkdb-s, with optional filtering
merge merge several segment indexes
$ nutch mergedb
Usage: CrawlDbMerger [ ...] [-normalize] [-filter]
output_crawldb output CrawlDb
crawldb1 ... input CrawlDb-s (single input CrawlDb is ok)
-normalize use URLNormalizer on urls in the crawldb(s) (usually not
needed)
-filter use URLFilters on urls in the crawldb(s)
$ nutch mergesegs
SegmentMerger output_dir (-dir segments | seg1 seg2 ...) [-filter] [-slice NNNN]
output_dir name of the parent dir for output segment slice(s)
-dir segments parent dir containing several segments
seg1 seg2 ... list of segment dirs
-filter filter out URL-s prohibited by current URLFilters
-slice NNNN create many output segments, each containing NNNN URLs
$ nutch mergelinkdb
Usage: LinkDbMerger [ ...] [-normalize] [-filter]
output_linkdb output LinkDb
linkdb1 ... input LinkDb-s (single input LinkDb is ok)
-normalize use URLNormalizer on both fromUrls and toUrls in
linkdb(s) (usually not needed)
-filter use URLFilters on both fromUrls and toUrls in linkdb(s)
$ nutch merge
Usage: IndexMerger [-workingdir ] outputIndex indexesDir...

Ah: the nutch javadoc has some comments on each of the above classes:
CrawlDbMerger - “nutch mergedb” – see also mergedb wiki
org.apache.nutch.crawl
Class CrawlDbMerger
java.lang.Object org.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.CrawlDbMerger

All Implemented Interfaces:

Configurable, Tool

public class CrawlDbMerger


extends ToolBase

This tool merges several CrawlDb-s into one, optionally filtering URLs through the current
URLFilters, to skip prohibited pages.

It’s possible to use this tool just for filtering – in that case only one CrawlDb should be specified in
arguments.

If more than one CrawlDb contains information about the same URL, only the most recent version
is retained, as determined by the value of CrawlDatum.getFetchTime(). However, all
metadata information from all versions is accumulated, with newer values taking precedence over
older values.

Author:

Andrzej Bialecki

SegmentMerger - “nutch mergesegs” – see also mergesegs wiki


org.apache.nutch.segment
Class SegmentMerger
java.lang.Object org.apache.hadoop.conf.Configured
org.apache.nutch.segment.SegmentMerger

All Implemented Interfaces:

Configurable, Closeable, JobConfigurable, Mapper, Reducer

public class SegmentMerger


extends Configured

implements Mapper, Reducer

This tool takes several segments and merges their data together. Only the latest versions of data is
retained.

Optionally, you can apply current URLFilters to remove prohibited URL-s.

Also, it’s possible to slice the resulting segment into chunks of fixed size.

Important Notes

Which parts are merged?

It doesn’t make sense to merge data from segments, which are at different stages of processing (e.g.
one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to
merging, the tool will determine the lowest common set of input data, and only this data will be
merged. This may have some unintended consequences: e.g. if majority of input segments are
fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists,
and it will skip all other data from all segments.

Merging fetchlists

Merging segments, which contain just fetchlists (i.e. prior to fetching) is not recommended, because
this tool (unlike the Generator doesn’t ensure that fetchlist parts for each map task are disjoint.

Duplicate content

Merging segments removes older content whenever possible (see below). However, this is NOT the
same as de-duplication, which in addition removes identical content found at different URL-s. In
other words, running DeleteDuplicates is still necessary.

For some types of data (especially ParseText) it’s not possible to determine which version is really
older. Therefore the tool always uses segment names as timestamps, for all types of input data.
Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments
with “higher” names will prevail. It follows then that it is extremely important that segments be
named in an increasing lexicographic order as their creation time increases.

Merging and indexes

Merged segment gets a different name. Since Indexer embeds segment names in indexes, any
indexes originally created for the input segments will NOT work with the merged segment. Newly
created merged segment(s) need to be indexed afresh. This tool doesn’t use existing indexes in any
way, so if you plan to merge segments you don’t have to index them prior to merging.

Author:

Andrzej Bialecki

LinkDbMerger - “nutch mergelinkdb” – see also mergelinkdb wiki


org.apache.nutch.crawl
Class LinkDbMerger

java.lang.Object org.apache.hadoop.util.ToolBase
org.apache.nutch.crawl.LinkDbMerger

All Implemented Interfaces:

Configurable, Closeable, JobConfigurable, Reducer, Tool

public class LinkDbMerger


extends ToolBase

implements Reducer

This tool merges several LinkDb-s into one, optionally filtering URLs through the current
URLFilters, to skip prohibited URLs and links.

It’s possible to use this tool just for filtering – in that case only one LinkDb should be specified in
arguments.

If more than one LinkDb contains information about the same URL, all inlinks are accumulated, but
only at most db.max.inlinks inlinks will ever be added.

If activated, URLFilters will be applied to both the target URLs and to any incoming link URL. If a
target URL is prohibited, all inlinks to that target will be removed, including the target URL. If
some of incoming links are prohibited, only they will be removed, and they won’t count when
checking the above-mentioned maximum limit.

Author:

Andrzej Bialecki

IndexMerger – “nutch merge” – see also merge wiki


org.apache.nutch.indexer
Class IndexMerger
java.lang.Object org.apache.hadoop.util.ToolBase
org.apache.nutch.indexer.IndexMerger

All Implemented Interfaces:

Configurable, Tool

public class IndexMerger


extends ToolBase
IndexMerger creates an index for the output corresponding to a single fetcher run.

Author:

Doug Cutting, Mike Cafarella

I wrote a post asking for clarification about the above four merge commands: “four nutch merge commands:
mergedb, mergesegs, mergelinkdb, merge” (Jul 2007).
Q: Naively: why are there four merge commands? Are some subsets of the others? Are they used in
conjunction? What are the usage scenarios of each?

A: Each is used in a different scenario

mergedb: as its name does not imply, it is used to merge crawldb. So consider this
mergecrawldb

mergesegs: merges segments. It merges <segment>/{content,crawl_fetch,


crawl_generate, crawl_parse, parse_data, parse_text} information from different
segments.

merge: Merges lucene indexes. After a index job, you end up with a indexes directory
with a bunch of part-<num> directories inside. Command merge takes such a directory
and produces a single index. A single index has a better performance (I think). You can
say that merge is poorly named, it should have been called mergeindexes or something.

mergelinkdb: Should be obvious, merges linkdb-s.

So none of them is a subset of another. They all have different purposes. It is kind of
confusing to have a “merge” command that only merges indexes, so perhaps we can add
a mergeindexes command, keep merge for some time (noting that it has been
deprecated) then remove it.

Q: It seems most of the nutch-user discussions I’ve seen so far relate to the simple merge command.
Are the first three “advanced commands”?

A: They serve different purpose – let’s assume that somehow you’ve got two crawldb-s,
e.g. you ran two crawls with different seed lists and different filters. Now you want to
take these collections of urls and create a one big crawl. Then you would use mergedb to
merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments

And a simple “merge” merges indexes of multiple segments, which is a performance-


related step in the regular Nutch work-cycle.

“Incremental indexing” (June 2007) discusses the complex aspects of recrawling/merging rather clearly. It’s too
bad nobody on nutch-user replied to it.
As the size of my data keeps growing, and the indexing time grows even faster, I’m trying to switch
from a “reindex all at every crawl” model to an incremental indexing one. I intend to keep the
segments separate, but I
want to index only the segment fetched during the last cycle, and then merge indexes and perhaps
linkdb. I have a few questions:

1. In an incremental scenario, how do I remove from the indexes references to segments that have
expired??

2. Looking at http://wiki.apache.org/nutch/MergeCrawl , it would appear that I can call “bin/nutch


merge” with only two parameters: the original index directory as destination, and the directory to be
merged in the former:

$nutch_dir/nutch merge $index_dir $new_indexes

But when I do that, the merged data are left in a subdirectory called $index_dir/merge_output .
Shouldn’t I instead create a new empty destination directory, do the merge, and then replace the
original with the newly merged directory:

merged_indexes=$crawl_dir/merged_indexesrm -rf $merged_indexes # just


in case it's already there
$nutch_dir/nutch merge $merged_indexes $index_dir $new_indexes
rm -rf $index_dir.old # just in case it's already there
mv $index_dir $index_dir.old
mv $merged_indexes $index_dir
rm -rf $index_dir.old

3. Regarding linkdb, does running “$nutch_dir/nutch invertlinks” on the latest segment only, and
then merging the newly obtained linkdb with the current one with “$nutch_dir/nutch mergelinkdb”,
make sense rather than recreating linkdb afresh from the whole set of segments every time? In other
words, can invertlinks work incrementally, or does it need to have a view of all segments in order to
work correctly?

Here’s a very current and rather complex question, with replies, titled “incremental growing index” (Jul 2007):
Q: Our crawler generates and fetches segments continuously. We’d like to index and merge each
new segment immediately (or with a small delay) such that our index grows incrementally. This is
unlike the normal situation where one would create a linkdb and an index of all segments at once,
after the crawl has finished. The problem we have is that Nutch currently needs the complete linkdb
and crawldb each time we want to index a single segment.

A: The reason for wanting the linkdb is the anchor information. If you don’t need any
anchor information, you can provide an empty linkdb.

The reason why crawldb is needed is to get the current page status information (which
may have changed in the meantime due to subsequent crawldb updates from newer
segments). If you don’t need this information, you can modify Indexer.reduce() (~line
212) method to allow for this, and then remove the line in Indexer.index() that adds
crawldb to the list of input paths.

Q: The Indexer map task processes all keys (urls) from the input files (linkdb, crawldb and
segment). This includes all data from the linkdb and crawldb that we actually don’t need since we
are only interested in the data that corresponds to the keys (urls) in our segment (this is filtered out
in the Indexer reduce task). Obviously, as the linkdb and crawldb grow, this becomes more and more
of a problem.

A: Is this really a problem for you now? Unless your segments are tiny, the indexing
process will be dominated by I/O from the processing of parseText / parseData and
Lucene operations.

Q: Any ideas on how to tackle this issue? Is it feasible to lookup the corresponding linkdb and
crawldb data for each key (url) in the segment before or during indexing?

A: It would be probably too slow, unless you made a copy of linkdb/crawldb on the
local FS-es of each node. But at this point the benefit of this change would be doubtful,
because of all the I/O you would need to do to prepare each task’s environment …

Q: Thanks Andrzej. Perhaps these numbers make our issue more clear:

- after a week of (internet) crawling, the crawldb contains about 22M documents.
- 6M documents are fetched, in 257 segments (topN = 25,000)
- size of the crawldb = 4,399 MB (22M docs, 0.2 kB/doc)
- size of the linkdb = 75,955 MB (22M docs, 3.5 kB/doc)
- size of a segment = somewhere between 100 and 500 MB (25K docs, 20 kB/doc
(max))

As you can see: for a segment of 500 MB, more than 99% of the IO during indexing is due to the
linkdb and crawldb. We could increase the size of our segments, but in the end this only delays the
problem. We are now indexing without the linkdb. This reduces the time needed by a factor 10. But
we would really like to have the link texts back in again in the future.

Here’s a thread I started a couple weeks back: “Interrupting a nutch crawl — or use topN?” (Jun 2007):
I am running a nutch crawl of 19 sites. I wish to let this crawl go for about two days then gracefully
stop it (I don’t expect it to complete by then). Is there a way to do this? I want it to stop crawling
then build the lucene
index. Note that I used a simple nutch crawl command, rather than the “whole web” crawling
methodology:

nutch crawl urls.txt -dir /usr/tmp/19sites -depth 10

Or is it better to use the -topN option?

Some documentation for topN:

“Re: How to terminate the crawl?”

“You can limit the number of pages by using the -topN parameter. This limits the
number of pages fetched in each round. Pages are prioritized by how well-linked they
are. The maximum number of pages that can be
fetched is topN*depth.”

Or from the tutorial:

-topN N determines the maximum number of pages that will be retrieved at each
level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50


Typically one starts testing one’s configuration by crawling at shallow depths, sharply
limiting the number of pages fetched at each level (-topN), and watching the output to
check that desired pages are fetched and undesirable pages are not. Once one is
confident of the configuration, then an appropriate depth for a full crawl is around 10.
The number of pages per level (-topN) for a full crawl can be from tens of thousands to
millions, depending on your resources.

Here was one response to my question:


I use a iterative approach using a script similar to what Sami blogs about here:

Online indexing – integrating Nutch with Solr

excerpt:

There might be times when you would like to integrate Apache Nutch crawling with a
single Apache Solr index server – for example when your collection size is limited to
amount of documents that can be served by single Solr instance, or you like to do your
updates on “live” index. By using Solr as your indexing server might even ease up your
maintenance burden quite a bit – you would get rid of manual index life cycle
management in Nutch and let Solr handle your index.

I then issue a crawl of 10,000 URLs at a time, and just repeat the process for as long as the window
available. because I use solr to store the crawl results. It makes the index available during the crawl
window. But I’m a relative newbie as well, so look forward what the experts say.

I looked at Sami Siren’s script; it’s pretty much the same as what I did a the top of this blog, except his script
“will execute one iteration of fetching and indexing.” The script’s only real difference is that it
uses ’SolrIndexer’ (that you write) rather than the normal Indexer class,
org.apache.nutch.indexer.Indexer (here’s the Indexer javadoc). I think I guess correctly that Indexer is what
runs when you do “nutch index” from the command line. Just to beat a dead horse a bit more, here’s an excerpt
from Sami’s script:
bin/nutch inject $BASEDIR/crawldb urls
checkStatus
bin/nutch generate $BASEDIR/crawldb $BASEDIR/segments -topN $NUMDOCS
checkStatus
SEGMENT=`bin/hadoop dfs -ls $BASEDIR/segments|grep $BASEDIR|cut -f1|sort|tail
-1`
echo processing segment $SEGMENT
bin/nutch fetch $SEGMENT -threads 20
checkStatus
bin/nutch updatedb $BASEDIR/crawldb $SEGMENT -filter
checkStatus
bin/nutch invertlinks $BASEDIR/linkdb $SEGMENT
checkStatus
bin/nutch org.apache.nutch.indexer.SolrIndexer $BASEDIR/crawldb $BASEDIR/linkdb
$SEGMENT
checkStatus

checkStatus is just a short function in the script that looks to see if any errors were generated by
whatever command ran last. I also note that Sami is using a hadoop command that I don’t
understand; the NutchHadoopTutorial mentions ‘hadoop dfs’ … but I think I may be drifting off
topic.
Here is the other response I got to my post:
In the past Andrzej put some stuff related to your issue in the Jira. Try to
look it up there.

Found it http://issues.apache.org/jira/browse/NUTCH-368

NUTCH-368: Message queueing system (Sep 2006)

This is an implementation of a filesystem-based message queueing system. The


motivation for this functionality is explained in HADOOP-490

HADOOP-490: Add ability to send “signals” to jobs and tasks

In some cases it would be useful to be able to “signal” a job and its tasks
about some external condition, or to broadcast a specific message to all tasks
in a job. Currently we can only send a single pseudo-signal, that is to kill a
job.

This patch uses the message queueing framework to implement the following
functionality in Fetcher:

* ability to gracefully stop fetching the current segment. This is different from simply
killing the job in that the partial results (partially fetched segment) are available and can
be further processed. This is especially useful for fetching large segments with long
“tails”, i.e. pages which are fetched very slowly, either because of politeness settings or
the target site’s bandwidth limitations.

* ability to dynamicaly adjust the number of fetcher threads. For a long-running fetch
job it makes sense to decrease the number of fetcher threads during the day, and increase
it during the night. This can be done now with a cron script, using the MsgQueueTool
command-line.

It’s worthwhile to note that the patch itself is trivial, and most of the work is done by the
MQ framework.

After you apply this patch you can start a long-running fetcher job, check its <jobId>,
and control the fetcher this way:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg


<job_id> ctrl THREADS 50

This adjusts the number of threads to 50 (starting more threads or stopping some threads
as necessary).

Then run:

bin/nutch org.apache.nutch.util.msg.MsgQueueTool -createMsg


<job_id> ctrl HALT

This will gracefully shut down all threads after they finish fetching their current url, and
finish the job, keeping the partial segment data intact.
Susam Pal has posted (Aug 2007) a new script to crawl with nutch 0.9:
#!/bin/sh

# Runs the Nutch bot to crawl or re-crawl


# Usage: bin/runbot [safe]
# If executed in 'safe' mode, it doesn't delete the temporary
# directories generated during crawl. This might be helpful for
# analysis and recovery in case a crawl fails.
#
# Author: Susam Pal

depth=2
threads=50
adddays=5
topN=2 # Comment this statement if you don't want to set topN value

# Parse arguments
if [ "$1" == "safe" ]
then
safe=yes
fi

if [ -z "$NUTCH_HOME" ]
then
NUTCH_HOME=.
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
else
echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
fi

if [ -z "$CATALINA_HOME" ]
then
CATALINA_HOME=/opt/apache-tomcat-6.0.10
echo runbot: $0 could not find environment variable NUTCH_HOME
echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
else
echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
fi

if [ -n "$topN" ]
then
topN="--topN $rank"
else
topN=""
fi

steps=8
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"


for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate crawl/crawldb crawl/segments $topN
-adddays $adddays
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d crawl/segments/* | tail -1`

$NUTCH_HOME/bin/nutch fetch $segment -threads $threads


if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth $depth failed. Deleting it."
rm -rf $segment
continue
fi

$NUTCH_HOME/bin/nutch updatedb crawl/crawldb $segment


done

echo "----- Merge Segments (Step 3 of $steps) -----"


$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments crawl/segments/*
if [ "$safe" != "yes" ]
then
rm -rf crawl/segments/*
else
mkdir crawl/FETCHEDsegments
mv --verbose crawl/segments/* crawl/FETCHEDsegments
fi

mv --verbose crawl/MERGEDsegments/* crawl/segments


rmdir crawl/MERGEDsegments

echo "----- Invert Links (Step 4 of $steps) -----"


$NUTCH_HOME/bin/nutch invertlinks crawl/linkdb crawl/segments/*

echo "----- Index (Step 5 of $steps) -----"


$NUTCH_HOME/bin/nutch index crawl/NEWindexes crawl/crawldb
crawl/linkdb crawl/segments/*

echo "----- Dedup (Step 6 of $steps) -----"


$NUTCH_HOME/bin/nutch dedup crawl/NEWindexes

echo "----- Merge Indexes (Step 7 of $steps) -----"


$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes

if [ "$safe" != "yes" ]
then
rm -rf crawl/NEWindexes
fi

echo "----- Reloading index on the search site (Step 8 of $steps) -----"
if [ "$safe" != "yes" ]
then
touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
echo Done!
else
echo runbot: Can not reload index in safe mode.
echo runbot: Please reload it manually using the following command:
echo runbot: touch ${CATALINA_HOME}/webapps/ROOT/WEB-INF/web.xml
fi

echo "runbot: FINISHED: Crawl completed!"

Susam comments as follows:


I have written this script to crawl with Nutch 0.9. Though, I have tried to take care that this should
work for re-crawls as well, but I have never done any real world testing for re-crawls. I use this to
crawl. You may try this out. We can make some changes if this is not found to be appropriate for re-
crawls.

Posted in Merging, Recrawling. 6 Comments »

Introductory comments to this blog


July 13, 2007 — nutch
From wikipedia:
Nutch is an effort to build an open source search engine based on Lucene Java for the search
and index component.

I am writing this blog in order to publicly document my exploration of the nutch crawler and get feedback about
what other folks have tried or discovered. I’ve already been using nutch for a few weeks so this blog doesn’t
start completely at the beginning for me, but I’ll try to be explanatory in how I write here. Like many open
source projects, nutch is poorly documented. This means that in order to find answers one has to make extensive
use of google plus comb the nutch forums: nutch-user and nutch-dev. (Those links are hosted at www.mail-
archive.com; they’re also hosted by www.nabble.com in a different format here: nutch-user and nutch-dev.) I’ve
found that people are pretty responsive on nutch-user. The nutch to-do list, bugs, and enhancements are
listed using JIRA software at issues.apache.org/jira/browse/Nutch.
Backdrop: I had latitude in making a choice of crawler/indexer, so in the beginning I read some general
literature such as “Crawling the Web” by Gautam Pant, Padmini Srinivasan, and Filippo Menczer. On
approaches to search the entertaining “Psychosomatic addict insane” (2007) discusses latent semantic indexing
and contextual network graphs. And let’s not forget spreading activation networks. Writing a crawler is not
easy so I looked at some java-based open source crawlers and started examining Heritrix. In a conversation
with Gordon Mohr of the internet archive I decided to go with nutch as he said Heritrix was more focused on
storing precise renditions of web pages and on storing multiple versions of the same page as it changes over
time. On the other hand, nutch just stores text, and it directly creates and accesses Lucene indexes whereas the
internet archive also has to use NutchWax to interact with Lucene.
The current version of nutch is 0.9; but rather than the main release I’m using one of the nightly builds that fixes
a bug I ran into (see the NUTCH-505 JIRA). The nightly build also has a more advanced RSS feed handler. But
I’m getting ahead of myself.
The best overall introductory article to nutch I’ve found so far is the following two-parter written by Tom White
in January of 2006. It has a brief overall description of nutch’s architecture, then delves into the specifics of
crawling a small example site; it tells how to set up nutch as well as tomcat, and what kind of sanity checks to
do on the results you get back.
• Introduction to Nutch, Part 1: Crawling
• Introduction to Nutch, Part 2: Searching.
On the architecture:
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and
turns them into an inverted index, which the searcher uses to answer users’ search queries. The
interface between the two pieces is the index, so apart from an agreement about the fields in the
index, the two are highly decoupled. (Actually, it is a little more complicated than this, since the
page content is not stored in the index, so the searcher needs access to the segments [a collection of
pages fetched and indexed by the crawler in a single run] below in order to produce page summaries
and to provide access to cached pages.)

The nutch site itself has a few items of note:


• The Version 0.8 Tutorial – Like the Tom White article, this has a lot of nuts-and-bolts advice. I believe it
is current for version 0.9, though I can’t guarantee that.
• FAQ – As of this writing the FAQ has about 40 questions, divided into sections. Some of the sections I
found worthwhile were:
• Injecting
• Fetching
• Indexing
• Segment Handling
• Searching
• API doc – (sparse)
There is an article written by nutch auther Doug Cutting as well as Rohit Khare, Kragen Sitaker, and Adam
Rifkinthat. It has a clean description of nutch’s architecture and is entitled “Nutch: A Flexible and Scalable
Open-Source Web Search Engine“.
Excerpt:
4.1 Crawling: An intranet or niche search engine might only take a single machine a few hours to
crawl, while a whole-web crawl might take many machines several weeks or longer. A single
crawling cycle consists of generating a fetchlist from the webdb, fetching those pages, parsing those
for links, then updating the webdb. In the terminology of [4], Nutch’s crawler supports both a crawl-
and-stop and crawl-and-stop-with-threshold (which requires feedback from scoring and specifying a
floor). It also uses a uniform refresh policy; all pages are refetched at the same interval (30 days, by
default) regardless of how frequently they change There is no feedback loop yet, though the design
of Page.java can set individual recrawl-deadlines on every page). The fetching process must also
respect bandwidth and other limitations of the target website. However, any polite solution requires
coordination before fetching; Nutch uses the most straightforward localization of references
possible: namely, making all fetches from a particular host run on one machine.

Another slide show (PDF) by Doug Cutting, ”Nutch, Open-Source Web Search“ shows the architecture:

Searcher: Given a query, it must quickly find a small relevant subset of a corpus of documents, then
present them. Finding a large relevant subset is normally done with an inverted index of the corpus;
ranking within that set to produce the most relevant documents, which then must be summarized for
display.
Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing
indexes.

Web DB: Stores the document contents for indexing and later summarization by the searcher, along
with information such as the link structure of the document space and the time each document was
last fetched.

Fetcher: Requests web pages, parses them, and extracts links from them. Nutch’s robot has been
written entirely from scratch.

There is a lengthy video presentation (71 minutes) with Doug Cutting, sponsored by IIIS in Helsinki, 2006. It
has an associated PDF slide show entitled “Open Source Platforms for Search“. The introduction has a
philosophical discourse on open source software then gets down to a meaty technical discussion after about
eight minutes. For instance, Doug discusses that with a single person as administrator, nutch scales well up to
about 100 million documents. Beyond that, billions of pages are “operationally onerous”.
One of the more widely linked articles articles by Doug Cutting and Mike Cafarella is “Building Nutch: Open
Source Search” (printer friendly version). On page 3 they outline nutch’s operational costs–note that these $
estimates were done in early 2004:
A typical back-end machine is a single-processor box with 1 gigabyte of RAM, a RAID controller,
and eight hard drives. The filesystem is mirrored (RAID level 1) and provides 1 terabyte of reliable
storage. Such a machine can be assembled for a cost of about $3,000…. A typical front-end machine
is a single-processor box with 4 gigabytes of RAM and a single hard drive. Such a machine can be
assembled for about $1,000…. Note that as traffic increases, front-end hardware quickly becomes
the dominant hardware cost.

A 2007 paper from IBM Research entitled “Scalability of the Nutch Search Engine” explores some blade server
configurations and uses mathematical models to conclude that nutch can scale well past the base cases they
actually run. Note that the paper is about the index/search aspect of nutch rather than the crawling.
Search workloads behave well in a scale-out environment. The highly parallel nature of this
workload, combined with a fairly predictable behavior in terms of processor, network and storage
scalability, makes search a perfect candidate for scale-out. Scalability to thousands of nodes is well
within reach, based on our evaluation that combines measurement data and modeling.

Lucene is the searching/indexing component of nutch; one of the things that attracted me to nutch was that I
would be able to have an end-to-end, customizable package to implement search. And either lucene or nutch
can be used for the query processing; nutch just has a simpler query syntax: it is optimized for the most
common web queries so it doesn’t support OR queries, for instance. There are other crawlers, such as Heritrix
which is very robust and is used by the internet archive, and other indexers like Xapian, which is very
performant. ‘Archiving “Katrina” Lessons Learned‘ was a project that chose to use Heritrix and NutchWax. For
now I’m happy with nutch+lucene. The one book I found that has much to say about Lucene (and even it has
only minimal coverage of nutch) is Lucene in Action by Erik Hatcher and Otis Gospodnetic. I should also
mention that the book has thorough coverage of Luke, a tool that is useful for playing with lucene indexes. The
apache lucene mailing lists in searchable form are java-user and java-dev. The lucene FAQ is frequently
updated.

Vous aimerez peut-être aussi