Version 1 (modified by shunfa, 12 years ago) (diff) |
---|
Nutch1.5 + Solr3.6.1
下載
Steps
0. 前置環境設定
安裝JAVA,確認環境變數
$ vim ~/.bashrc
加入下列參數(or其他版本的Java路徑)
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
1. Nutch設定
解壓縮nutch安裝包
$ tar zxvf apache-nutch-1.5-bin.tar.gz
- 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示
確認是否可以執行
- 執行以下指令
$ [$NUTCH_HOME]/bin/nutch
- 執行結果
Usage: nutch COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering solrindex run the solr indexer on parsed segments and linkdb solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url domainstats calculate domain statistics from crawldb webgraph generate a web graph from existing segments linkrank run a link analysis program on the generated web graph scoreupdater updates the crawldb with linkrank scores nodedumper dumps the web graph's node scores plugin load a plugin and run one of its classes main() junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
- 若出現以上片段,則執行環境OK!
設定爬取機器人名稱
$ vim [$NUTCH_HOME]/conf/nutch-site.xml
- 加入以下資訊:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
設定欲爬取的網址
- 建立網址資料(以爬取http://www.nchc.)
$ mkdir -p [$NUTCH_HOME]/urls $ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
設定filter
$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
- 用下列文字取代原始設定
# accept anything else +.
透過指令執行爬取任務
- 深度3層,每層最多抓取五個文件
$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-09-11 16:25:29 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. ...(略)
- 出現以下訊息,表示已經抓取完成
Attachments (1)
- Alalysis_Page.png (97.0 KB) - added by shunfa 12 years ago.
Download all attachments as: .zip