wiki:jazz/11-10-28

Context Navigation

Version 9 (modified by jazz, 14 years ago) (diff)
--

2011-10-28

Sematic Web & Crawlzilla

延續 2010-11-14
- 由於注意到 Google Reader 訂閱時會出現歷史的紀錄，因此最近在思考能否從歷史 RSS 當作爬取資料的來源（Ex. 給抓抓龍用），所以查了一下有沒有類似的作法。Google 的文章解釋了可行的作法：
  - Reconstruct a Feed's History Using Google Reader
```
http://www.google.com/reader/atom/feed/FEED_URL?r=n&n=NUMBER_OF_ITEMS
```
延續 2010-11-15
- ReadItLater 的 API - 拿來爬平常標記起來的網址
Evernote 的 API - 如果有用 Evernote 寫筆記的人，應該也可以拿來統計筆記的內容

Crawlzilla

<應用> 書籤分析!!
我從 readitlater 的網站上，使用 export HTML 功能，把未讀的書籤匯出成 HTML 檔，並上傳到 http://cloud.nchc.org.tw/~jazz/ril_export.html

使用 demo.crawlzilla.info 設定爬兩層，

索引庫名稱 ril
搜尋引擎連結位置 /home/crawler/crawlzilla/user/jazz/IDB/ril/index
搜尋引擎狀態 OK
爬取深度 2
建立時間 20111028-16:57:36
執行時間 0:19:4
起始連結 http://cloud.nchc.org.tw/~jazz/ril_export.html
總共文字數 241996
文件檔數量 3677

產生的索引庫 - http://demo.crawlzilla.info/jazz_ril/zh/
從統計結果可以知道我觀察的前五十大資料來源：

0	http://www.digitimes.com.tw	204
1	http://www.bnext.com.tw	112
2	http://groups.google.com	74
3	http://www.theregister.co.uk	56
4	http://highscalability.com	52
5	http://www.ithome.com.tw	49
6	http://www.cloudera.com	48
7	http://gigaom.com	44
8	http://www.networkworld.com	38
9	http://en.wikipedia.org	38
10	http://www.zdnet.com.tw	36
11	http://www.howtoforge.com	33
12	http://wiki.apache.org	32
13	http://www.ibm.com	28
14	http://nosql.mypopescu.com	28
15	http://www.freegroup.org	28
16	http://ajaxian.com	27
17	http://www.linuxfordevices.com	25
18	http://news.networkmagazine.com.tw	24
19	http://ieeexplore.ieee.org	24
20	http://insidehpc.com	23
21	http://www.readwriteweb.com	23
22	http://www.linux-mag.com	23
23	http://www.nosqldatabases.com	21
24	http://only-perception.blogspot.com	21
25	http://www.inside.com.tw	19
26	http://www.linkedin.com	19
27	http://www.openfoundry.org	18
28	http://www.sys-con.com	17
29	http://www.hortonworks.com	16
30	http://news.cnet.com	16
31	http://people.debian.org.tw	16
32	http://www.h-online.com	16
33	http://www.slideshare.net	15
34	http://blog.sematext.com	15
35	http://packages.debian.org	14
36	http://lwn.net	14
37	http://sourceforge.net	14
38	http://virtualization.info	13
39	http://www.infoq.com	13
40	http://radar.oreilly.com	13
41	http://blog.gslin.org	13
42	http://gevaperry.typepad.com	13
43	http://www.cyberciti.biz	12
44	http://blog.roodo.com	12
45	http://www.libthomas.org	12
46	http://www.runpc.com.tw	11
47	http://blog.opennebula.org	11
48	http://cloudsecurity.trendmicro.com	11
49	http://developer.yahoo.com	11

Download in other formats:

Plain Text