[[PageOutline]] = Crawlzilla 2.0 = * 近期發現的 Bug / 缺點 * install 程式不支援無線網卡 * 升級/反安裝 -> 舊的資料如何保存或移植延續?!(Stateless) * Recrawl 進行時必須保留原本的 CrawlDB,等完成後才覆蓋掉。 * Fix Job 流程忘了刪除 HDFS 的 crawldb ? {{{ crawler@CrawlzillaServ:~$ /opt/crawlzilla/nutch/bin/hadoop fs -lsr jazz drwxr-xr-x - crawler supergroup 0 2012-09-14 21:59 /user/crawler/jazz drwxr-xr-x - crawler supergroup 0 2012-09-14 21:59 /user/crawler/jazz/wang drwxr-xr-x - crawler supergroup 0 2012-09-14 21:59 /user/crawler/jazz/wang/crawldb }}} * 想法: * 關於 Re-Crawl 重新載入 Tomcat 的方法 * [參考] http://www.mulesoft.com/tomcat-reload * 透過 http://[hostname]:[port]/manager/reload?path=[/path/to/your/webapp] * [參考] 若未來採用 Solr - http://wiki.apache.org/solr/CoreAdmin#RELOAD * 支援 opensearch.xml - ([wiki:jazz/12-03-29 2012-03-29]) * 套件化(分離 Nutch, Lucene, Hadoop 的部份) - 預設用單機版 * 與 Solr 結合 -> 與 ElasticSearch 結合?! * 與 Carrot2 結合? 搜尋結果分群分類(Search Clustering) * [參考] http://wiki.apache.org/nutch/ClusteringPlugin * 支援網頁截圖?!(2012-12-04) * http://blog.jangmt.com/2009/10/cutycapt.html * http://blog.saymoon.com/2009/11/take-snapshot-in-linux-command-line/ * https://github.com/istvan-antal/CutyCapt - http://cutycapt.sourceforge.net/ * http://iecapt.sourceforge.net/ - [http://www.zubrag.com/scripts/website-thumbnail-generator.php 參考一] - [http://www.zubrag.com/articles/create-website-snapshot-thumbnail.php 參考二] * http://code.google.com/p/minemine/wiki/WebPageGrabber * https://github.com/coderholic/PyWebShot - 用 Python 呼叫 Mozilla 瀏覽器做網頁截圖 - 用到 [http://www-archive.mozilla.org/unix/gtk-embedding.html GtkMozEmbed] * http://www.boutell.com/webthumb/ * http://code.google.com/p/browsershots/ - http://browsershots.org/ 服務背後用的程式碼 * 繪製 URL 關聯圖 * http://d3js.org/ - Ex. http://www.jasondavies.com/collatz-graph/ * http://dhotson.github.com/springy/ * http://arborjs.org/ * 套件相依: * bc {{{ 480: large16=$(echo "$JAVA_version >= 1.6" | bc) }}} * dialog * expect * lsb_release {{{ ./install: line 593: expect: command not found ./install: line 968: lsb_release: command not found }}} * 過去有人提報過的 Nutch Debian 套件 WNPP * http://lists.debian.org/debian-wnpp/2006/02/msg00225.html * 有人問我說抓抓龍與商業版的龍捲風有何優勢?! * http://www.tornado.com.tw/gov/ts * 參考: * 個人化書籤搜尋雲端服務 - http://historio.us/ == opensearch.xml == * 最近在用 Firefox 11.0 的時候,發現搜尋框多了一個功能,會去讀 opensearch.xml 。所以 crawlzilla 應該可以讓使用者更方便地把搜尋網址加入瀏覽器搜尋框。 * http://www.opensearch.org/Community/OpenSearch_search_clients * Trac 基本上從很早以前就支援 open search 了 * http://trac.edgewall.org/changeset/4331 (2006-11-22) {{{ ~$ dpkg -L trac | grep opensearch /usr/share/pyshared/trac/search/templates/opensearch.xml }}} * [參考] [https://developer.mozilla.org/en/Creating_OpenSearch_plugins_for_Firefox Creating OpenSearch plugins for Firefox] * [參考] 還可以讓 search pluin 支援搜尋建議 - [https://developer.mozilla.org/en/Supporting_search_suggestions_in_search_plugins Supporting search suggestions in search plugins] * [推廣] 可以丟到 http://www.searchplugins.net/pluginlist.aspx