Changes between Version 5 and Version 6 of waue/2010/1029


Ignore:
Timestamp:
Oct 29, 2010, 5:03:57 PM (14 years ago)
Author:
waue
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • waue/2010/1029

    v5 v6  
    99[[PageOutline]]
    1010
    11 == 前言 ==
     11= 前言 =
    1212crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後,剩下來的動作就不做了,hadoop 沒有job ,而go.sh 則 idle永遠顯示 crawling的動作, 無法跑到finish。
    1313
     
    2727}}}
    2828
    29 == 手動修復步驟 ==
     29= 手動修復步驟 =
    3030
    3131
     
    3434}}}
    3535
    36  === index ===
     36 == index ==
    3737 * linkdb tw_yahoo_com_6/linkdb
    3838
     
    4646}}}
    4747
    48  === index-lucene ===
     48 == index-lucene ==
    4949 
    5050 * index-lucene tw_yahoo_com_6/indexes
     
    6060
    6161
    62  === dedup ===
     62 == dedup ==
    6363 
    6464 * dedup 1: urls by time        100.00%
     
    7474/opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index
    7575}}}
     76
     77
     78 == download and import==
     79
     80 {{{
     81 /opt/crawlzilla/nutch/bin/hadoop dfs -get cw_yahoo_5 ~/crawlzilla/archieve/cw_yahoo_5
     82 cd ~/crawlzilla/archieve/
     83 echo "0h:0m:0s" >> ./cw_yahoo_5/cw_yahoo_5PassTime
     84 echo "5" >> ./cw_yahoo_5/.crawl_depth
     85 cd ~/crawlzilla/archieve/cw_yahoo_5/index
     86 mv part-00000/* ./
     87 rmdir part-00000/
     88 }}}