wiki:waue/2009/0609

Version 5 (modified by waue, 15 years ago) (diff)

--

Nutch 研究與隨筆

前言

  • 目前開發NutchEz 已經可以運作了,但都是基本功能,也找出某些問題
  • 希望在完整的看完Nutch的官方網頁後,得到更好的靈感與改進方式

更多指令

readdb

  • read / dump crawl db
  • Usage: CrawlDbReader? <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
  • -stats [-sort] print overall statistics to System.out
    $ nutch readdb /tmp/search/crawldb -stats
    
    09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720
    09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680
    09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls:	1072
    09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):	1002
    09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):	68
    
  • -dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir>
  • -url <url> print information on <url> to System.out
  • -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir>

inject

  • inject new urls into the database
  • Usage: Injector <crawldb> <url_dir>

readlinkdb

  • read / dump link db
  • Usage: LinkDbReader? <linkdb> {-dump <out_dir> | -url <url>)
    $ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump
    $ vim ./dump/part-00000
    

readseg

  • read / dump segment data
  • Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]
  • SegmentReader -dump <segment_dir> <output> [general options]
  • SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
  • SegmentReader -get <segment_dir> <keyValue> [general options]

updatedb

  • update crawl db from segments after fetching
  • Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
    $ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/
    

dedup

  • remove duplicates from a set of segment indexes
  • Usage: DeleteDuplicates <indexes> ...
    $ nutch dedup /tmp/search/indexes/
    

筆記