Version 14 (modified by waue, 13 years ago) (diff) |
---|
Nutch 1.3
[intro]
- 7 June 2011 - Apache Nutch 1.3 Released
[setup]
get
- extract to /opt/nutch-1.3
cd /opt/nutch-1.3 ant
deploy
可將 bin/nutch 與 nutch-1.3.job 放到 hadoop 與之整合
local
cd /opt/nutch-1.3/runtime/local
- bin/nutch (inject)
export JAVA_HOME="/usr/lib/jvm/java-6-sun"
- conf/nutch-site.xml (inject)
<configuration> <property> <name>http.agent.name</name> <value>waue_test</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property> <property> <name>http.robots.agents</name> <value>nutch</value> </property> <property> <name>http.agent.url</name> <value>waue_test</value> </property> <property> <name>http.agent.email</name> <value>waue_test</value> </property> <property> <name>http.agent.version</name> <value>waue_test</value> </property> </configuration>
- conf/regex-urlfilter.txt (replace) (1.2 conf/crawl-urlfilter.txt)
-^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. -[*!] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops #-.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +.
[setup solr]
- extract to /opt/solr-3.3.0/
cd /opt/solr-3.3.0/ cp /opt/nutch-1.3/conf/schema.xml /opt/solr-3.3.0/example/solr/conf/ cd /opt/solr-3.3.0/example/ java -jar start.jar
[execute]
mkdir urls ; echo "http://lucene.apache.org/nutch/" >urls/url.txt bin/nutch crawl urls -dir crawl2 -depth 2 -topN 50
- you will get only 3 directories.
crawldb linkdb segments
- finally , connect nutch result to solr
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
- using web admin to check
http://localhost:8983/solr/admin/
run-once
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
FAQ
- Q1 : where is or how to build the "war" file ?
- A1 :
Simple answer here is no. Both the web app and Lucene index which previously shipped with Nutch has been deprecated. Please have a a look at the new tutorial [1] and the site for more information on the new functionality and features which ship with Nutch 1.3 [1] http://wiki.apache.org/nutch/RunningNutchAndSolr