wiki:jazz/12-01-04

2012-01-04

AJAX Crawler / Crawling AJAX

  • 2010-10-17
  • <參考> Crawling AJAX
    Shreeraj Shah's paper, Crawling Ajax-driven Web 2.0 Applications, does a nice job of 
    describing the "event-driven" approach to web crawling.
    
    It has following three key components
    
    1. Javascript analysis and interpretation with linking to Ajax
    2. DOM event handling and dispatching
    3. Dynamic DOM content extraction
    
    The easiest way to implement an AJAX-enabled, event-driven crawler is to use Watir and 
    Crowbar, that will allow you to control Firefox or IE from code, allowing you to extract 
    page data after it has processed any Javascript.
    
  • 可以用的工具包括基於 Ruby 可以控制 IE 的 Watir,跟可以用 GET/PUT 方式控制 Firefox 的 Crowbar,兩個的授權都是 BSD。
  • Making AJAX Applications Crawlable - Google 提出一個應變標準(Specification)來讓 AJAX 應用程式或網頁可以被搜尋得到。
  • crawljax - 用 Java 寫的 AJAX Crawler ,有很多論文發表
  • http://watij.com/ - Watij – Web Application Testing in Java
  • http://htmlunit.sourceforge.net/ - HtmlUnit is a "GUI-Less browser for Java programs"
Last modified 12 years ago Last modified on Jan 24, 2012, 9:20:42 AM