Version 13 (modified by waue, 17 years ago) (diff) |
---|
五月份 工作計畫
預定目標
- Building Nutch/Hadoop? project in eclipse
- 完成hadoop 範例教學 -> 字數統計
- 完成nutch 應用範例 -> 索引硬碟資料
- 檢查cps是否有錯誤
工作里程
5/20
- 使用eclipse build hadoop 有些工具可以使用 1. subclipse (svn in eclipse plugin) 2. IBM mapReduce toolkit (an applicantion of Hadoop in eclipse)
- 安裝Subclipse
- Help -> Software Updates -> Find and Install... -> search for new features ... -> new remote site -> name : subclipse , site : http://subclipse.tigris.org/update
- 由於會出現軟體相依性的問題,因此我是加入舊版&新版的site,並且兩個都勾選安裝,如此一來就可以安裝完成
- Window->Show View->Other…-> SVN->SVN Repository -> new site : http://svn.apache.org/repos/asf/hadoop/core/
- 右鍵menu -> Checkout… -> 點擊Finish即完成
- 若出現 Problem: Javahl interface is not available 的問題,參照以下解決
- sudo apt-get install libsvn-javahl libsvn-dev
- sudo ln -s /usr/lib/jni/libsvnjavahl-1.so /usr/lib/jvm/java-6-sun/jre/lib/i386/client/libsvnjavahl-1.so
- 安裝Subclipse
5/19
- 繼續測試在eclipse 編譯 nutch,發現在上面run的都是jar檔,解開後裡面包的是class file ,因此修改程式碼的方法還要研究
- 5/16的步驟今天突然無法執行,後來解決的方法如下:
- ssh localhost 不可有密碼
- 檢查設定檔,如 hadoop.env.sh , nutch.site.xml...
- 出現 connect localhost:9000 failed => 1. hadoop namenode -format 2. startup_all.sh 3. hadoop dfs -put urls urls 之後在執行run
5/16
- 感謝sunni指點迷津,nutch 成功build in nutch
- File ==> new ==> Project ==> java project ==> Next ==> Project name (設成 nutch0.9) ==> Contents ==> Create project from existing(選擇存放nutch路徑) ==> Finish.
- 此時會出現366個error , 即使用網路上得除錯方法:將兩個jar( jid3lib-0.5.1.jar 和 rtf-parser.jar ) 放入nutch-0.9的lib文件夾下。在Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Librarles ==> Add External JARs... ==> 點選剛下載的兩個jar ==>ok
- 但此時還是有一堆錯誤,解決的方法是 Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Source ==>將資料夾圖示的都刪掉,僅加入nutch/conf
- 此時會看到所有的錯誤都解除,接著修改 nutch/conf 內的 nutch-site.xml 、 crawl-urlfilter.txt、hadoop.site.xml、hodoop.env.sh,並在nutch/ 下加入 urls/urls.txt,並將要掃描的網址寫入urls.txt
- Menu Run > "Run..." ==> create "New" for "Java Application"
- set in Main class = org.apache.nutch.crawl.Crawl
- on tab Arguments:
- Program Arguments = urls -dir crawl -depth 3 -topN 50
- in VM arguments: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
- click on "Run"
5/15
- building nutch in eclipse
- 協助 F. P. Lin 申請nchcca憑證 加入pragma grid
5/14
- 將nutch 加入 eclipse 作building 但有錯誤
- 設定nutch
- File > New > Project > "Java project" > click Next
- project 命名為 nutch
- Select "Create project from existing source" and use the location where you downloaded Nutch
- Click on Next, and wait while Eclipse is scanning the folders
- Libraries(第三個tagJ) Add class Floder -> "conf"
- Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
- Set output dir to "tmp_build", create it if necessary
5/13
- 檢查並修改trac 上得cps是否為維誠給的 [apgrid同意版]
- 將word檔內容貼到文字檔 ori.txt
- 將trac上得內容貼到文字檔 new.txt
- 消除不同的格式 (如 "\n\n"->" \n " (換兩行->換一行)、"._"、 "_*_"、":_\n")
- vimdiff new.txt ori.txt
- 檢查結果如下 http://trac.nchc.org.tw/gocca/wiki/CPSnew?action=diff&version=27&old_version=26
5/12
- 完成 nchc cp/cps v.1.13 (apgrid 同意) 版本 於 ca網站及trac
主要麻煩為:最後經weicheng確認並認可的是word檔,要轉成html且wiki格式的步驟流程
- cps 1.1.3 on doc format -> new.txt文字檔
- cps 1.1.0 on html format in trac -> old.txt文字檔
- vimdiff new.txt old.txt 檢視不同資訊 並修改 trac上舊的cps 1.1.0資訊成新版的
- 將新的cps 1.1.3 從trac存到local 並用KompoZer編輯成原ca網站上的格式
- 上傳並取代舊版
5/8
- 基於資安問題將nutch限制瀏覽ip,修改conf/server.xml檔,加入
<Context path="/path/to/secret_files" ...> <Valve className="org.apache.catalina.valves.RemoteAddrValve" allow="127.0.0.1" deny=""/> </Context>
5/7
- nutch 運作於 管理規範專區成功,並parse進pdf,word內容 改法為在nutch.site.xml加入內容
<property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>... </description> </property>
parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)內的檔名需要對應plugins中parse-XXX的名稱而定
5/5
- nutch 運作於 管理規範專區成功,但內容卻不包含pdf, word, ...