| 11 | * 合適使用 Droids 的實機:資料量相對小,爬很窄的範圍,沒有擴充性需求。- [參考] [http://www.listware.net/201006/lucene-solr-user/60153-solr-and-nutchdroids-to-use-or-not-to-use.html Solr and Nutch/Droids - to use or not to use?] |
| 12 | {{{ |
| 13 | From what I know, Droids is just the crawler with an in-memory queue + link extractor. We did use it for crawling Lucene project sites (for the index on http://search-lucene.com/ ), but that is because the data volume is low, the crawl very narrow, scaling requirements low, etc. |
| 14 | }}} |
| 15 | * 另外這篇文章解釋了我一直有疑惑的 AJAX 爬取問題 |
| 16 | * [http://www.ajaxprojects.com/ajax/newsdetails.php?itemid=178 Crawling AJAX] |
| 17 | {{{ |
| 18 | Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of |
| 19 | describing the "event-driven" approach to web crawling. |
| 20 | |
| 21 | It has following three key components |
| 22 | |
| 23 | 1. Javascript analysis and interpretation with linking to Ajax |
| 24 | 2. DOM event handling and dispatching |
| 25 | 3. Dynamic DOM content extraction |
| 26 | |
| 27 | The easiest way to implement an AJAX-enabled, event-driven crawler is to use Watir and |
| 28 | Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract |
| 29 | page data after it has processed any Javascript. |
| 30 | }}} |
| 31 | * 可以用的工具包括基於 Ruby 可以控制 IE 的 [http://watir.com/ Watir],跟可以用 GET/PUT 方式控制 Firefox 的 [http://simile.mit.edu/wiki/Crowbar Crowbar],兩個的授權都是 BSD。 |