{{{ #!html
crawlzilla 新版
v 1.0
}}} [[PageOutline]] = 目標 = * 多人共用版本 * 網頁介面更新 * 加入排程等新功能 * 更新 nutch 版本至 1.2 * svn 庫上的安裝測試模式 * slave安裝可搭配網頁引導 = 系統分析 = == 目錄結構 == * /home/crawler/crawlzilla || 目錄1 || 目錄2 || 說明 || || ./workspace/ || || hadoop 的運算資料夾 || || ./meta/ || || dialog 產生的中間檔及系統設定檔 || || ./meta/tmp/ || || 暫存檔 || || ./user/ || || 於後說明 || * /home/crawler/crawlzilla/user 下的目錄格式說明 || 目錄1 || 目錄2 || 說明 || || [admin,_username_]/ || || admin 為必有資料夾,_username_ 為之後新增的使用者 || || || ./webs/ || 內放搜尋網頁的資料夾 (註1)|| || || ./webs/_DBName_/ || 名稱為_DBName_的搜尋網頁 || || || ./IDB/ || 內放該使用者已完成的 indexDB 資料夾 || || || ./IDB/_DBName_/ || _DBName_ 為索引庫名稱 || || || ./IDB/_DBName_/meta/ || meta 放每個索引庫的相關檔案 || || || ./IDB/_DBName_/index~segments/ || index~segments 為 lucene db 的必要五個資料夾|| || || ./tmp/ || 內放該使用者未完成的 indexDB 資料夾 || || || ./tmp/_DBName_/ || _DBName_ 為索引庫名稱 || || || ./tmp/_DBName_/meta/ || meta 放每個索引庫的相關檔案 || || || ./meta/ || 該使用者的個人資訊,如pwd,email 等 || || || ./meta/_crontab.conf || 該使用者的個人排程資訊 || || || ./old/_DBName_/ || _DBName_ 為索引庫名稱 || * /opt/crawlzilla/ || 目錄1 || 目錄2 || 說明 || || ./tomcat/ || || tomcat || || ./tomcat/ || ./webapps/_username_/_DBName_ || 對應到 _username_ 的 _DBName_ 索引庫 (註1)|| || ./nutch/ || || nutch 的目錄 || || ./slave/ || || 給 slave 安裝需要的檔案 || || ./main/ || || 放 crawlzilla 的執行檔|| 註1: /home/crawler/crawlzilla/user/_username_/webs/_DBName_ ==鍊結到==> /opt/crawlzilla/tomcat/webapps/_username_/_DBName_ [[BR]] 如:ln -sf /home/crawler/crawlzilla/user/admin/webs/test_3 /opt/crawlzilla/tomcat/webapps/admin/test_3 * /var/log/crawlzilla/ || 目錄1 || 目錄2 || 說明 || || ./hadoop-logs/ || || || || ./hadoop-pids/ || || || || ./shell-logs/ || || || || ./tomcat-logs/ || || || == 新舊 檔案\目錄 對照 == || 舊 || ==> || 新 || 說明 || || /home/crawler/crawlzilla/logs || ==> || || 刪除此鍊結 || || /home/crawler/crawlzilla/nutch || ==> || || 刪除此鍊結 || || /home/crawler/crawlzilla/tmp || ==> || /home/crawler/crawlzilla/meta/tmp || || || /home/crawler/crawlzilla/source || ==> || /opt/crawlzilla/slave || || || /home/crawler/crawlzilla/archieve/_DBName_ || ==> || /home/crawler/crawlzilla/user/admin/IDB/_DBName_ || || || /home/crawler/crawlzilla/urls/urls.txt || ==> || /home/crawler/crawlzilla/user/admin/tmp/_DBName_/meta/urls/urls.txt || || || /home/crawler/crawlzilla/.metadata/_DBName_ || ==> || /home/crawler/crawlzilla/user/admin/IDB/_DBName_/meta || (註2) || || /home/crawler/crawlzilla/.menu_tmp || ==> || /home/crawler/crawlzilla/meta/menu_tmp || || || /home/crawler/crawlzilla/system/ || ==> || 於下說明 || || 註2: 0.3 版以前,無論完成與否的IDB中間資料都放在 /home/crawler/crawlzilla/.metadata/。但 1.0 版以後,未完成的 /home/crawler/crawlzilla/user/admin/tmp/_DBName_/meta ,完成之後搬移到 /home/crawler/crawlzilla/user/admin/IDB/_DBName_/meta * /home/crawler/crawlzilla/system: || 舊 || ==> || 新 || 說明 || || 執行檔 || ==> || /opt/crawlzilla/main/執行檔 || 如 crawlzilla, install, go.sh ... || || lang/ || ==> || /opt/crawlzilla/main/lang/ || 語言檔資料夾 || || hosts || ==> || /home/crawler/crawlzilla/meta/hosts || || || hosts.old || ==> || /home/crawler/crawlzilla/meta/hosts.old || || || hosts.bak || ==> || /home/crawler/crawlzilla/meta/hosts.bak || || || version || ==> || /opt/crawlzilla/version || || || crawl_nodes || ==> || /home/crawler/crawlzilla/meta/crawl_nodes || || || crawl_nodes.bak || ==> || /home/crawler/crawlzilla/meta/crawl_nodes.bak || || || crawl_nodes.old || ==> || /home/crawler/crawlzilla/meta/crawl_nodes.old || || || .passwd || ==> || /home/crawler/crawlzilla/user/admin/meta/.passwd || || == 新增 == * /home/crawler/crawlzilla/user/$USERNAME/meta/crawl_schedule * 各使用者的"排程"資訊 (用於 crontab 的中間資料) * /home/crawler/crawlzilla/user/$USERNAME/tmp/$JNAME/meta/starttime * 從 1970/01/01 的總秒數 * /home/crawler/crawlzilla/user/$USERNAME/tmp/$JNAME/meta/begindate * 格式為 20110316-17:28:22 * /home/crawler/crawlzilla/user/$USERNAME/IDB/$JNAME/meta/passtime * 格式為 17:28:22 == 環境參數 == {{{ #!sh # env OptMain="/opt/crawlzilla/main" OptWebapp="/opt/crawlzilla/tomcat/webapps" OptNutchBin="/opt/crawlzilla/nutch/bin" HomeUserDir="/home/crawler/crawlzilla/user" HdfsHome="/user/crawler" # local para HomeUserTmp="$HomeUserDir/$USERNAME/tmp" HomeUserIDB="$HomeUserDir/$USERNAME/IDB" HomeUserWeb="$HomeUserDir/$USERNAME/webs" HomeUserMeta="$HomeUserDir/$USERNAME/meta" }}} (以下將淘汰) {{{ #!text * Crawlzilla_Install_PATH="/opt/crawlzilla" * Tomcat_HOME="/opt/crawlzilla/tomcat" * Crawlzilla_HOME="/home/crawler/crawlzilla" * Work_Path=$Crawlzilla_HOME/system * Manu_Tmp_Path="/home/crawler/crawlzilla/meta" * Hadoop_Daemon="/opt/crawlzilla/nutch/bin/hadoop-daemon.sh" * PID_Dir="/var/log/crawlzilla/hadoop-pids" * Crawl_Nodes=$Crawlzilla_HOME/meta/crawl_nodes }}} = 功能 = == 佇列爬取程式 go.sh == * /opt/crawlzilla/main/go.sh * go.sh 用以 lib_crawl_default.sh 為基礎,將參數餵給 lib_crawl_tmp.sh ,最後用 at 呼叫 lib_crawl_tmp.sh 執行 lib_crawl_go.sh * username = 使用者名稱,必要欄位,如 admin * jobname 為工作名稱,必要欄位,如 0316 * depth 為深度,必要欄位,如 1~5 {{{ #!graphviz digraph G { rankdir = "LR" "go.sh" -> "lib_crawl_tmp.sh" "go.sh" -> "lib_crawl_default.sh" "lib_crawl_default.sh" -> "lib_crawl_tmp.sh" "lib_crawl_tmp.sh" -> "lib_crawl_go.sh" -> "準備&nutch&網頁" } }}} * 註: at -f 後的shell 內容不可太大,需在數行之內,否則不予執行,因此用了許多個 shell 來達成之前的一個go.sh * 註:雖然使用at 的原因,使得整個程序似乎繞一大圈,但用其來解決crontab 無法執行長時間程序的問題 * <以下已淘汰 on 20110401> go.sh !["redo"] * redo 為是否重爬, 非必要欄位, 是="redo", 不是="" == 直接爬取程式 lib_crawl_go.sh == * /opt/crawlzilla/main/go.sh * 不丟到at,直接執行 (go.sh 的最後階段) == 爬取準備 prepare_go.sh == * /opt/crawlzilla/main/prepare_go.sh * 方便command 端測試用,會產生指定的 ,並詢問是否引發 go.sh == 修復爬取的程式: fix.sh == * /opt/crawlzilla/main/fix.sh * fix.sh 已支援 索引庫匯入後的後續步驟處理 * 將 索引庫 indexpool 放在 username 下的 tmp 資料夾, 執行以下命令,則程式會自動完成建立索引庫 indexpool 的動作 {{{ fix.sh username indexpool }}} = 效能測試 = [wiki:crawlzilla-1.0-performance]