close
Warning:
Can't synchronize with repository "(default)" (Unsupported version control system "svn": /usr/lib/python2.7/dist-packages/libsvn/_delta.so: failed to map segment from shared object: Cannot allocate memory). Look in the Trac log for more information.
- Timestamp:
-
Aug 19, 2010, 4:28:32 PM (15 years ago)
- Author:
-
shunfa
- Comment:
-
--
Legend:
- Unmodified
- Added
- Removed
- Modified
-
|
v1
|
v2
|
|
| 2 | 2 | = Crawlzilla 網頁執行介面 = |
| 3 | 3 | |
| | 4 | 管理介面預設網址為:http://localhost:8080 或 http://ServerIP:8080,登入後首頁如下: |
| | 5 | |
| | 6 | [[image(1.png)]] |
| | 7 | |
| 4 | 8 | == 設定網頁管理者密碼 == |
| 5 | 9 | |
| | 10 | 首次進入網頁介面時,必須先重設管理者密碼(預設密碼為:crawler),設定密碼點選送出並重新登入後就可執行系統。 |
| | 11 | |
| | 12 | [[image(2.png)]] |
| | 13 | |
| 6 | 14 | == 建立第一個搜尋引擎 == |
| 7 | | |
| 8 | | === 開啟所有運算服務 === |
| 9 | 15 | |
| 10 | | === 至Crawl網頁中設定爬取項目 === |
| | 16 | === Step1. 開啟所有運算服務 === |
| 11 | 17 | |
| 12 | | === 瀏覽網頁爬取進度 === |
| | 18 | 由於執行Crawl必須透過Hadoop運算,因此執行Crawl前請先依序確認以下服務是否已開啟,若為關閉狀態,請依序開啟這些服務。 |
| 13 | 19 | |
| 14 | | === 索引庫操作 === |
| | 20 | * Namenode and Jobtracker |
| | 21 | * Datanode and Tasktracker(至少需開啟一個運算節點) |
| 15 | 22 | |
| | 23 | 若不熟悉開啟步驟,請參考[wiki:crawlzilla/sysmanagement_zh 系統管理介面操作說明] |
| | 24 | |
| | 25 | === Step2. 至Crawl網頁中設定爬取項目 === |
| | 26 | |
| | 27 | 依序填入:索引庫名稱,欲抓取的網址(可多行,如圖所示)及設定爬取深度即可送出 |
| | 28 | |
| | 29 | [[image(3.png)]] |
| | 30 | |
| | 31 | 送出後如圖所示,等候時間需視視每台主機的運算速度而定。 |
| | 32 | |
| | 33 | [[image(4.png)]] |
| | 34 | |
| | 35 | === Step3. 瀏覽網頁爬取進度 === |
| | 36 | |
| | 37 | 透過系統狀態頁面,可即時了解網頁爬取進度 |
| | 38 | |
| | 39 | [[image(5.png)]] |
| | 40 | |
| | 41 | 待出現"Finish"表示索引庫已建立,並可將此一訊息刪除 |
| | 42 | |
| | 43 | [[image(6.png)]] |
| | 44 | |
| | 45 | === Step4. 索引庫操作 === |
| | 46 | |
| | 47 | ==== 索引庫瀏覽 ==== |
| | 48 | |
| | 49 | ==== 索引庫刪除 ==== |
| | 50 | |
| | 51 | === Step5. 測試搜尋引擎功能 === |
| 16 | 52 | |
| 17 | 53 | == 其他功能 == |