Changes between Version 6 and Version 7 of jazz/10-10-17


Ignore:
Timestamp:
Oct 17, 2010, 8:41:53 PM (14 years ago)
Author:
jazz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • jazz/10-10-17

    v6 v7  
    99 * 在[http://www.slideshare.net/sematext/projecthub sematext ProjectHub 投影片中],則介紹了如何打造 http://search-hadoop.com/ 與 http://search-lucene.com/ 這些網站背後的架構。
    1010   * [[Image(sematext_architecture.png,width=800)]]
     11 * 合適使用 Droids 的實機:資料量相對小,爬很窄的範圍,沒有擴充性需求。- [參考] [http://www.listware.net/201006/lucene-solr-user/60153-solr-and-nutchdroids-to-use-or-not-to-use.html Solr and Nutch/Droids - to use or not to use?]
     12{{{
     13From what I know, Droids is just the crawler with an in-memory queue + link extractor. We did use it for crawling Lucene project sites (for the index on http://search-lucene.com/ ), but that is because the data volume is low, the crawl very narrow, scaling requirements low, etc.
     14}}}
     15 * 另外這篇文章解釋了我一直有疑惑的 AJAX 爬取問題
     16   * [http://www.ajaxprojects.com/ajax/newsdetails.php?itemid=178 Crawling AJAX]
     17{{{
     18Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of
     19describing the "event-driven" approach to web crawling.
     20
     21It has following three key components
     22
     231. Javascript analysis and interpretation with linking to Ajax
     242. DOM event handling and dispatching
     253. Dynamic DOM content extraction
     26
     27The easiest way to implement an AJAX-enabled, event-driven crawler is to use Watir and
     28Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract
     29page data after it has processed any Javascript.
     30}}}
     31   * 可以用的工具包括基於 Ruby 可以控制 IE 的 [http://watir.com/ Watir],跟可以用 GET/PUT 方式控制 Firefox 的 [http://simile.mit.edu/wiki/Crowbar Crowbar],兩個的授權都是 BSD。
    1132
    1233== 簡報技巧 ==