Nutch 1.3


  • 7 June 2011 - Apache Nutch 1.3 Released



get nutch

  • extract to /opt/nutch-1.3
cd /opt/nutch-1.3


可將 bin/nutch 與 nutch-1.3.job 放到 hadoop 與之整合


cd /opt/nutch-1.3/runtime/local
  • bin/nutch (inject)
export JAVA_HOME="/usr/lib/jvm/java-6-sun"
  • conf/nutch-site.xml (inject)
  • conf/regex-urlfilter.txt (replace) (1.2 conf/crawl-urlfilter.txt)

# skip image and other suffixes we can't yet parse

# skip URLs containing certain characters as probable queries, etc.

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops

# accept anything else

[setup solr]

  • extract to /opt/solr-3.3.0/
cd /opt/solr-3.3.0/
cp /opt/nutch-1.3/conf/schema.xml /opt/solr-3.3.0/example/solr/conf/
cd /opt/solr-3.3.0/example/
java -jar start.jar


mkdir urls ; echo "" >urls/url.txt
bin/nutch crawl urls -dir crawl2 -depth 2 -topN 50
  • you will get only 3 directories.
    crawldb  linkdb  segments
  • finally , connect nutch result to solr
bin/nutch solrindex crawl/crawldb crawl/linkdb crawl/segments/*
  • using web admin to check



bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5


  • Q1 : where is or how to build the "war" file ?
  • A1 :
    Simple answer here is no.
    Both the web app and Lucene index which previously shipped with Nutch has
    been deprecated.
    Please have a a look at the new tutorial [1] and the site for more
    information on the new functionality and features which ship with Nutch 1.3
