{{{
#!html
<div style="text-align: center; color:#151B8D"><big style="font-weight: bold;"><big><big>
Crawlzilla v 0.2.2 異常錯誤處理步驟
</big></big></big></div> <div style="text-align: center; color:#7E2217"><big style="font-weight: bold;"><big>
nutch 1.0 + hadoop 0.19 + solr 1.3.0
</big></big></div>
}}}
[[PageOutline]]

== 前言 ==
crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後，剩下來的動作就不做了，hadoop 沒有job ，而go.sh 則 idle永遠顯示 crawling的動作， 無法跑到finish。

原因可能有：
 * 資料量太大 ：總共文字數 超過 10萬筆
 * 執行過久 ： 總程序跑超過3h

沒有跑到的程序有：
      	
{{{
#!text
linkdb       	_JOB_DIR_/linkdb
index-lucene    JOB_DIR_/indexes	100.00%
dedup 1:       	urls by time	100.00%
dedup 2:       	content by hash	100.00%
dedup 3:       	delete from index(es)
}}}

== 手動修復步驟 ==


{{{
cd /opt/crawlzilla/nutch
}}}

 === index ===
 * linkdb tw_yahoo_com_6/linkdb

{{{
#!java
Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...) 
}}}

{{{
$ /opt/crawlzilla/nutch/bin/nutch invertlinks /user/crawler/cw_yahoo_5/linkdb -dir /user/crawler/cw_yahoo_5/segments/
}}}

 === index-lucene ===
 
 * index-lucene tw_yahoo_com_6/indexes
 
{{{
#!java
Usage: Indexer <index> <crawldb> <linkdb> <segment> ...
}}}

{{{
$ /opt/crawlzilla/nutch/bin/nutch index /user/crawler/cw_yahoo_5/index /user/crawler/cw_yahoo_5/crawldb /user/crawler/cw_yahoo_5/linkdb /user/crawler/cw_yahoo_5/segments/20101027234843 /user/crawler/cw_yahoo_5/segments/20101027234956 /user/crawler/cw_yahoo_5/segments/20101027235315 /user/crawler/cw_yahoo_5/segments/20101028000804 /user/crawler/cw_yahoo_5/segments/20101028002826
}}}


 === dedup ===
 
 * dedup 1: urls by time	100.00%
 * dedup 2: content by hash	100.00%
 * dedup 3: delete from index(es)

{{{
#!java
Usage: DeleteDuplicates <indexes> ...
}}}

{{{
/opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index
}}}