waue/2010/1029 – Cloud Computing

wiki:waue/2010/1029

Context Navigation

Version 3 (modified by waue, 15 years ago) (diff)
--

Crawlzilla v 0.2.2 異常錯誤處理步驟

nutch 1.0 + hadoop 0.19 + solr 1.3.0

1. 前言
2. 手動修復步驟
  1. index
  2. index-lucene
  3. dedup

前言

crawlzilla 0.2.2 所用的 nutch 1.0 有時爬得網站會出現執行完 " crawldb + generate + fetch "的循環之後，剩下來的動作就不做了，hadoop 沒有job ，而go.sh 則 idle永遠顯示 crawling的動作，無法跑到finish。

原因可能有：

資料量太大：總共文字數超過 10萬筆
執行過久：總程序跑超過3h

沒有跑到的程序有：

linkdb        _JOB_DIR_/linkdb
index-lucene    JOB_DIR_/indexes  100.00%
dedup 1:        urls by time  100.00%
dedup 2:        content by hash 100.00%
dedup 3:        delete from index(es)

手動修復步驟

cd /opt/crawlzilla/nutch

index

linkdb tw_yahoo_com_6/linkdb

Usage: LinkDb <linkdb> (-dir <segmentsDir> | <seg1> <seg2> ...)

$ /opt/crawlzilla/nutch/bin/nutch invertlinks /user/crawler/cw_yahoo_5/linkdb -dir /user/crawler/cw_yahoo_5/segments/

index-lucene

index-lucene tw_yahoo_com_6/indexes

Usage: Indexer <index> <crawldb> <linkdb> <segment> ...

$ /opt/crawlzilla/nutch/bin/nutch index /user/crawler/cw_yahoo_5/index /user/crawler/cw_yahoo_5/crawldb /user/crawler/cw_yahoo_5/linkdb /user/crawler/cw_yahoo_5/segments/20101027234843 /user/crawler/cw_yahoo_5/segments/20101027234956 /user/crawler/cw_yahoo_5/segments/20101027235315 /user/crawler/cw_yahoo_5/segments/20101028000804 /user/crawler/cw_yahoo_5/segments/20101028002826

dedup

dedup 1: urls by time 100.00%
dedup 2: content by hash 100.00%

dedup 3: delete from index(es)

/opt/crawlzilla/nutch/bin/nutch dedup /user/crawler/cw_yahoo_5/index

Download in other formats:

Plain Text