[[PageOutline]]

= Crawlzilla 2.0 =

 * 民怨前十大：
   * 爬行時間太久而失敗
   * Recrawl 後 Tomcat 自動 Reload 新的 IDB 失敗
   * 無法切換中文
   * 需要認證帳號密碼的情形
 * 近期發現的 Bug / 缺點
   * install 程式不支援無線網卡
   * 升級/反安裝 -> 舊的資料如何保存或移植延續?!(Stateless)
   * Recrawl 進行時必須保留原本的 CrawlDB，等完成後才覆蓋掉。
   * Fix Job 流程忘了刪除 HDFS 的 crawldb ?
{{{
crawler@CrawlzillaServ:~$ /opt/crawlzilla/nutch/bin/hadoop fs -lsr jazz
drwxr-xr-x   - crawler supergroup          0 2012-09-14 21:59 /user/crawler/jazz
drwxr-xr-x   - crawler supergroup          0 2012-09-14 21:59 /user/crawler/jazz/wang
drwxr-xr-x   - crawler supergroup          0 2012-09-14 21:59 /user/crawler/jazz/wang/crawldb
}}}
 * 核心技術：
   * 必須要有公開的詞庫，才能做到更好的搜尋精準度、斷字斷詞
     * http://3du.tw/
     * http://www.audreyt.org/newdict/moedict-webkit/
     * https://raw.github.com/tony1223/3du.tw-phrase/master/src/words.csv
     * http://kcwu.csie.org/~kcwu/moedict/
     * [參考] [http://trac.nchc.org.tw/cloud/wiki/waue/2011/0801 waue 加入詞庫的方法]
     * [參考] [http://trac.nchc.org.tw/cloud/wiki/waue/2011/0930 waue 製作詞庫的方法 -from gcin]
   * 必須要有語料的語意分類，才有辦法做到較佳的分類
   * [http://140.111.34.54/mandr/content.aspx?site_content_sn=11701 教育部5部線上電子辭典收錄字詞情形及說明]
 * 想法：
   * 關於 Re-Crawl 重新載入 Tomcat 的方法
     * [參考] http://www.mulesoft.com/tomcat-reload
     * 透過 http://[hostname]:[port]/manager/reload?path=[/path/to/your/webapp]
     * [參考] 若未來採用 Solr - http://wiki.apache.org/solr/CoreAdmin#RELOAD
   * 支援 opensearch.xml - ([wiki:jazz/12-03-29 2012-03-29])
   * 套件化(分離 Nutch, Lucene, Hadoop 的部份) - 預設用單機版
   * 與 Solr 結合 -> 與 ElasticSearch 結合？!
   * 與 Carrot2 結合？ 搜尋結果分群分類(Search Clustering)
     * [參考] http://wiki.apache.org/nutch/ClusteringPlugin
   * 支援網頁截圖？！(2012-12-04)
     * http://blog.jangmt.com/2009/10/cutycapt.html
     * http://blog.saymoon.com/2009/11/take-snapshot-in-linux-command-line/
     * https://github.com/istvan-antal/CutyCapt - http://cutycapt.sourceforge.net/
     * http://iecapt.sourceforge.net/ - [http://www.zubrag.com/scripts/website-thumbnail-generator.php 參考一] - [http://www.zubrag.com/articles/create-website-snapshot-thumbnail.php 參考二]
     * http://code.google.com/p/minemine/wiki/WebPageGrabber
     * https://github.com/coderholic/PyWebShot - 用 Python 呼叫 Mozilla 瀏覽器做網頁截圖 - 用到 [http://www-archive.mozilla.org/unix/gtk-embedding.html GtkMozEmbed]
     * http://www.boutell.com/webthumb/
     * http://code.google.com/p/browsershots/ - http://browsershots.org/ 服務背後用的程式碼
   * 繪製 URL 關聯圖
     * http://d3js.org/ - Ex. http://www.jasondavies.com/collatz-graph/
     * https://github.com/jacomyal/sigma.js - http://sigmajs.org/
     * http://blog.ivank.net/force-based-graph-drawing-in-javascript.html - http://g.ivank.net
     * http://dhotson.github.com/springy/
     * http://arborjs.org/
 * 套件相依:
   * bc
{{{
480:      large16=$(echo "$JAVA_version >= 1.6" | bc)
}}}
   * dialog
   * expect
   * lsb_release
{{{
./install: line 593: expect: command not found
./install: line 968: lsb_release: command not found
}}}

 * 過去有人提報過的 Nutch Debian 套件 WNPP
   * http://lists.debian.org/debian-wnpp/2006/02/msg00225.html

 * 有人問我說抓抓龍與商業版的龍捲風有何優勢？！
   * http://www.tornado.com.tw/gov/ts

 * 參考：
  * 個人化書籤搜尋雲端服務 - http://historio.us/

 * 觀察：
  * https://github.com/linkedin/indextank-engine - !LinkedIn 買下 indexTank 並公開原始碼

== opensearch.xml ==

 * 最近在用 Firefox 11.0 的時候，發現搜尋框多了一個功能，會去讀 opensearch.xml 。所以 crawlzilla 應該可以讓使用者更方便地把搜尋網址加入瀏覽器搜尋框。
   * http://www.opensearch.org/Community/OpenSearch_search_clients
 * Trac 基本上從很早以前就支援 open search 了
   * http://trac.edgewall.org/changeset/4331 （2006-11-22）
{{{
~$ dpkg -L trac | grep opensearch
/usr/share/pyshared/trac/search/templates/opensearch.xml
}}}

 * [參考] [https://developer.mozilla.org/en/Creating_OpenSearch_plugins_for_Firefox Creating OpenSearch plugins for Firefox]
 * [參考] 還可以讓 search pluin 支援搜尋建議 - [https://developer.mozilla.org/en/Supporting_search_suggestions_in_search_plugins Supporting search suggestions in search plugins]
 * [推廣] 可以丟到 http://www.searchplugins.net/pluginlist.aspx