- read / dump crawl db
- Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
- -stats [-sort] print overall statistics to System.out
$ nutch readdb /tmp/search/crawldb -stats
09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720
09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680
09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072
09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1002
09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68
- -dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir>
$ nutch readdb /tmp/search/crawldb/ -dump ./dump
$ vim ./dump/part-00000
- -url <url> print information on <url> to System.out
$ nutch readdb /tmp/search/crawldb/ -url http://www.nchc.org.tw/tw/
URL: http://www.nchc.org.tw/tw/
Version: 7
Status: 6 (db_notmodified)
Fetch time: Thu Jul 09 14:34:48 CST 2009
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 3.1152809
Signature: ce0202bbd593b09b86ce8a9aa991b321
Metadata: _pst_: success(1), lastModified=0
- -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir>