Nutch 研究與隨筆
前言
- 目前開發NutchEz 已經可以運作了,但都是基本功能,也找出某些問題
- 希望在完整的看完Nutch的官方網頁後,得到更好的靈感與改進方式
更多指令
readdb
- read / dump crawl db
- Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
- -stats [-sort] print overall statistics to System.out
$ nutch readdb /tmp/search/crawldb -stats 09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720 09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680 09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1002 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68
- -dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir>
$ nutch readdb /tmp/search/crawldb/ -dump ./dump $ vim ./dump/part-00000
- -url <url> print information on <url> to System.out
$ nutch readdb /tmp/search/crawldb/ -url http://www.nchc.org.tw/tw/ URL: http://www.nchc.org.tw/tw/ Version: 7 Status: 6 (db_notmodified) Fetch time: Thu Jul 09 14:34:48 CST 2009 Modified time: Thu Jan 01 08:00:00 CST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 3.1152809 Signature: ce0202bbd593b09b86ce8a9aa991b321 Metadata: _pst_: success(1), lastModified=0 $ nutch readdb /tmp/search/crawldb/ -url http://www.nchc.org.tw URL: http://www.nchc.org.tw not found
- -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir>
inject
- inject new urls into the database
- Usage: Injector <crawldb> <url_dir>
readlinkdb
- read / dump link db
- Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
$ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump $ vim ./dump/part-00000
readseg
- read / dump segment data
- Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]
- SegmentReader -dump <segment_dir> <output> [general options]
$ nutch readseg -dump /tmp/search/segments/20090609143444/ ./dump/ $ vim ./dump/dump
- SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
$ nutch readseg -list /tmp/search/segments/20090609143444/ NAME GENERATED FETCHER START FETCHER END FETCHED PARSED 20090609143444 1 2009-06-09T14:34:48 2009-06-09T14:34:48 1 1
- SegmentReader -get <segment_dir> <keyValue> [general options]
$ nutch readseg -get /tmp/search/segments/20090609143444/ http://bioinfo.nchc.org.tw/
updatedb
- update crawl db from segments after fetching
- Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
$ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/
dedup
- remove duplicates from a set of segment indexes
- Usage: DeleteDuplicates <indexes> ...
$ nutch dedup /tmp/search/indexes/
筆記
Last modified 16 years ago
Last modified on Jun 9, 2009, 5:02:13 PM