Saturday, May 12, 2012

Nutch readseg

A sample readseg command:
bin/nutch readseg -dump crawl-test/segments/20110201114/ dump -nogenerate -noparse -noparsedata -noparsetex

Nutch and Solr:
The Nutch crawler is ideal for crawling unstructured data like PDF, Word Documents and HTML. Solr is better for crawling Structured data such as XML, Databases etc. It scales better for Enterprise level search.
To sum up: use Nutch for indexing unstructured data; Use Solr for databases and structured data; Integrate both the indexes and use Solr to serve search results.

4 comments: