Changes between Version 4 and Version 5 of waue/2009/0609
- Timestamp:
- Jun 9, 2009, 4:33:59 PM (16 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
waue/2009/0609
v4 v5 8 8 9 9 === readdb === 10 10 - read / dump crawl db 11 - Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>) 12 - -stats [-sort] print overall statistics to System.out 11 13 {{{ 12 14 $ nutch readdb /tmp/search/crawldb -stats 13 15 14 16 09/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720 15 16 17 09/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680 17 18 18 09/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072 19 19 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched): 1002 20 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68 21 }}} 22 - -dump <out_dir> [-format normal|csv ] dump the whole db to a text file in <out_dir> 23 - -url <url> print information on <url> to System.out 24 - -topN <nnnn> <out_dir> [<min>] dump top <nnnn> urls sorted by score to <out_dir> 20 25 21 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched): 68 26 === inject === 27 - inject new urls into the database 28 - Usage: Injector <crawldb> <url_dir> 22 29 23 }}} 24 === convdb === 25 - 沒啥用 26 === inject === 27 30 === readlinkdb === 31 - read / dump link db 32 - Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>) 28 33 {{{ 29 30 }}} 31 === readlinkdb === 32 {{{ 33 34 }}} 35 36 === === 37 {{{ 38 34 $ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump 35 $ vim ./dump/part-00000 39 36 }}} 40 37 === readseg === 41 38 - read / dump segment data 42 - Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]43 - SegmentReader -dump <segment_dir> <output> [general options]44 - SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]45 - SegmentReader -get <segment_dir> <keyValue> [general options]39 - Usage: !SegmentReader (-dump ... | -list ... | -get ...) [general options] 40 - !SegmentReader -dump <segment_dir> <output> [general options] 41 - !SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options] 42 - !SegmentReader -get <segment_dir> <keyValue> [general options] 46 43 {{{ 47 44 … … 49 46 === updatedb === 50 47 - update crawl db from segments after fetching 51 - Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]48 - Usage: !CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions] 52 49 {{{ 53 50 $ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/ … … 55 52 === dedup === 56 53 - remove duplicates from a set of segment indexes 57 - Usage: DeleteDuplicates <indexes> ...54 - Usage: !DeleteDuplicates <indexes> ... 58 55 {{{ 59 56 $ nutch dedup /tmp/search/indexes/ 60 61 57 }}} 62 58 == 筆記 ==