{{{ #!html

Crawlzilla v 0.2.2 異常錯誤處理步驟

nutch 1.0 + hadoop 0.19 + solr 1.3.0

}}} [[PageOutline]] = 前言 = crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後，剩下來的動作就不做了，hadoop 沒有job ，而go.sh 則 idle永遠顯示 crawling的動作，無法跑到finish。原因可能有： * 資料量太大：總共文字數超過 10萬筆 * 執行過久：總程序跑超過3h 沒有跑到的程序有： {{{ #!text linkdb _JOB_DIR_/linkdb index-lucene JOB_DIR_/indexes 100.00% dedup 1: urls by time 100.00% dedup 2: content by hash 100.00% dedup 3: delete from index(es) }}} = 手動修復步驟 = {{{ cd /opt/crawlzilla/nutch }}} == index == * linkdb tw_yahoo_com_6/linkdb {{{ #!java Usage: LinkDb (-dir | ...) }}} {{{ $ /opt/crawlzilla/nutch/bin/nutch invertlinks /user/crawler/cw_yahoo_5/linkdb -dir /user/crawler/cw_yahoo_5/segments/ }}} == index-lucene == * index-lucene tw_yahoo_com_6/indexes {{{ #!java Usage: Indexer ... }}} {{{ $ /opt/crawlzilla/nutch/bin/nutch index /user/crawler/cw_yahoo_5/index /user/crawler/cw_yahoo_5/crawldb /user/crawler/cw_yahoo_5/linkdb /user/crawler/cw_yahoo_5/segments/20101027234843 /user/crawler/cw_yahoo_5/segments/20101027234956 /user/crawler/cw_yahoo_5/segments/20101027235315 /user/crawler/cw_yahoo_5/segments/20101028000804 /user/crawler/cw_yahoo_5/segments/20101028002826 }}} == dedup == * dedup 1: urls by time 100.00% * dedup 2: content by hash 100.00% * dedup 3: delete from index(es) {{{ #!java Usage: DeleteDuplicates ... }}} {{{ /opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index }}} == download and import == {{{ /opt/crawlzilla/nutch/bin/hadoop dfs -get cw_yahoo_5 ~/crawlzilla/archieve/cw_yahoo_5 cd ~/crawlzilla/archieve/ echo "0h:0m:0s" >> ./cw_yahoo_5/cw_yahoo_5PassTime echo "5" >> ./cw_yahoo_5/.crawl_depth cd ~/crawlzilla/archieve/cw_yahoo_5/index mv part-00000/* ./ rmdir part-00000/ }}} = 自動化script =