Context Navigation

Changes between Version 6 and Version 7 of jazz/10-10-17

Timestamp:: Oct 17, 2010, 8:41:53 PM (15 years ago)
Author:: jazz
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

jazz/10-10-17

-                      v6
+                      v7
  * 在[http://www.slideshare.net/sematext/projecthub sematext ProjectHub 投影片中]，則介紹了如何打造 http://search-hadoop.com/ 與 http://search-lucene.com/ 這些網站背後的架構。
    * [[Image(sematext_architecture.png,width=800)]]
+ * 合適使用 Droids 的實機：資料量相對小，爬很窄的範圍，沒有擴充性需求。- [參考] [http://www.listware.net/201006/lucene-solr-user/60153-solr-and-nutchdroids-to-use-or-not-to-use.html Solr and Nutch/Droids - to use or not to use?]
+{{{
+From what I know, Droids is just the crawler with an in-memory queue + link extractor. We did use it for crawling Lucene project sites (for the index on http://search-lucene.com/ ), but that is because the data volume is low, the crawl very narrow, scaling requirements low, etc.
+}}}
+ * 另外這篇文章解釋了我一直有疑惑的 AJAX 爬取問題
+   * [http://www.ajaxprojects.com/ajax/newsdetails.php?itemid=178 Crawling AJAX]
+{{{
+Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of
+describing the "event-driven" approach to web crawling.
+It has following three key components
+. Javascript analysis and interpretation with linking to Ajax
+. DOM event handling and dispatching
+. Dynamic DOM content extraction
+The easiest way to implement an AJAX-enabled, event-driven crawler is to use Watir and
+Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract
+page data after it has processed any Javascript.
+}}}
+   * 可以用的工具包括基於 Ruby 可以控制 IE 的 [http://watir.com/ Watir]，跟可以用 GET/PUT 方式控制 Firefox 的 [http://simile.mit.edu/wiki/Crowbar Crowbar]，兩個的授權都是 BSD。
 == 簡報技巧 ==