Changes between Version 4 and Version 5 of waue/2009/0609


Ignore:
Timestamp:
Jun 9, 2009, 4:33:59 PM (15 years ago)
Author:
waue
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • waue/2009/0609

    v4 v5  
    88
    99 === readdb ===
    10 
     10 - read / dump crawl db
     11 - Usage: CrawlDbReader <crawldb> (-stats | -dump <out_dir> | -topN <nnnn> <out_dir> [<min>] | -url <url>)
     12 - -stats [-sort]       print overall statistics to System.out
    1113{{{
    1214$ nutch readdb /tmp/search/crawldb -stats
    1315
    141609/06/09 12:18:13 INFO mapred.MapTask: data buffer = 79691776/99614720
    15 
    161709/06/09 12:18:13 INFO mapred.MapTask: record buffer = 262144/327680
    17 
    181809/06/09 12:18:14 INFO crawl.CrawlDbReader: TOTAL urls: 1072
    191909/06/09 12:18:14 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    1002
     2009/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):      68
     21}}}
     22 - -dump <out_dir> [-format normal|csv ]        dump the whole db to a text file in <out_dir>
     23 - -url <url>   print information on <url> to System.out
     24 - -topN <nnnn> <out_dir> [<min>]       dump top <nnnn> urls sorted by score to <out_dir>
    2025
    21 09/06/09 12:18:14 INFO crawl.CrawlDbReader: status 2 (db_fetched):      68
     26 === inject ===
     27 - inject new urls into the database
     28 - Usage: Injector <crawldb> <url_dir>
    2229
    23 }}}
    24  === convdb ===
    25  - 沒啥用
    26  === inject ===
    27  
     30 === readlinkdb ===
     31 - read / dump link db
     32 - Usage: LinkDbReader <linkdb> {-dump <out_dir> | -url <url>)
    2833{{{
    29 
    30 }}}
    31  === readlinkdb ===
    32 {{{
    33 
    34 }}}
    35 
    36  ===  ===
    37 {{{
    38 
     34$ nutch readlinkdb /tmp/search/linkdb/ -dump ./dump
     35$ vim ./dump/part-00000
    3936}}}
    4037 === readseg ===
    4138 - read / dump segment data
    42  - Usage: SegmentReader (-dump ... | -list ... | -get ...) [general options]
    43  - SegmentReader -dump <segment_dir> <output> [general options]
    44  - SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
    45  - SegmentReader -get <segment_dir> <keyValue> [general options]
     39 - Usage: !SegmentReader (-dump ... | -list ... | -get ...) [general options]
     40 - !SegmentReader -dump <segment_dir> <output> [general options]
     41 - !SegmentReader -list (<segment_dir1> ... | -dir <segments>) [general options]
     42 - !SegmentReader -get <segment_dir> <keyValue> [general options]
    4643{{{
    4744
     
    4946 === updatedb ===
    5047 - update crawl db from segments after fetching
    51  - Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
     48 - Usage: !CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter] [-noAdditions]
    5249{{{
    5350$ nutch updatedb /tmp/search/crawldb/ -dir /tmp/search/segments/
     
    5552 === dedup ===
    5653 - remove duplicates from a set of segment indexes
    57  - Usage: DeleteDuplicates <indexes> ...
     54 - Usage: !DeleteDuplicates <indexes> ...
    5855{{{
    5956$ nutch dedup /tmp/search/indexes/
    60 
    6157}}}
    6258 == 筆記 ==