Context Navigation

Changes between Version 99 and Version 100 of waue

Timestamp:: Sep 2, 2008, 11:03:21 AM (17 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

waue

-                      v99
+                      v100
  = 五月份 工作計畫 =
  * [完成 5/04] 檢查cps是否有錯誤
  * [完成 5/22] hadoop 範例教學 -> 字數統計
 …
 [[br]]
  * [wiki:waue_may 五月份工作日誌]
+ = 五月份 工作計畫 =
+ == 預定目標 ==
+ * [完成 5/04] 檢查cps是否有錯誤
+ * [完成 5/22] hadoop 範例教學 -> 字數統計
+ * [完成 5/24] Building Nutch/Hadoop project in eclipse
+ * [完成 5/27] Programing map-reduce with eclipse 文件
+ * [new 5/28 ] map-reduce 程式設計
+ * [new ] nutch 應用範例 -> 索引硬碟資料
+ == 工作里程 ==
+ === 5/29 ===
+ * 已建立 tw.org.nchc. 的 package 檔以供 java import 之用
+ * 研究hadoop 1.16.4 的程式碼，（hadoop  1.17.0 已公佈， 有些api有改變）
+ * 實做HBase 範例 ：方法如下
+.  [http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/hbase-0.1.2/ 下載原始碼]  解開
+.  設定 conf/hbase-site.xml
+{{{
+<configuration>
+  <property>
+    <name>hbase.master</name>
+    <value>example.org:60000</value>
+    <description>The host and port that the HBase master runs at.
+    </description>
+  </property>
+  <property>
+    <name>hbase.rootdir</name>
+    <value>hdfs://example.org:9000/hbase</value>
+    <description>The directory shared by region servers.
+    </description>
+  </property>
+</configuration>
+}}}
+.  設定 hbase-env.sh
+{{{
+JAVA_HOME=/usr/lib/jvm/java-6-sun
+...
+CLASSPATH=/home/waue/workspace/hadoop/conf
+}}}
+. set regionserver , slaves  about host
+. bin/start-hbase.sh  (執行此程式前，需確認HDFS有開啟)
+ * 雖然 hbase 有在運作，但sample code仍報錯
+ === 5/28 ===
+ * 研究map-reduce 程式碼
+ * hadoop 1.16.2 之後的版本就把hbase分離開來成獨立項目，因此/hadoop/src/java/org/apache/hadoop 已無hbase資料夾項目  （import org.apache.hadoop.hbase.* 將會出錯）
+ === 5/27 ===
+ * hadoop技術文件完成   How to Coding Hadoop with Eclipse [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse.odt (odt format)] or [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse.pdf (pdf format)]
+ === 5/26 ===
+ * hadoop技術文件 How to Coding Hadoop with Eclipse and svn [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse_svn.odt (odt format)]
+ === 5/23 ===
+ * 文件製作中
+ * 開會
+ === 5/22 ===
+ * 成功編譯hadoop in Eclipse 並解決昨天的問題
+   * 解決錯誤 5.  ==>  Window > Preferences > java> compiler： 設定 compiler compliance level to 5.0 （變成9個warning）
+   * 解決錯誤 7.1 ==>  add a new MapReduce server location > server name : 任意 、 Hostname : localhost 、 Installation directory: /home/waue/workspace/nutch/ 、 Username : waue
+   * 解決錯誤 7.2 ==> 其實在執行7.1之前需要先將hadoop filesystem startup 起來才行，並且將範例檔放入hadoop filesystem中如bin/hadoop dfs -put 132.txt test
+   * 解決錯誤 8 ==> 打開umd-hadoop-core > src > edu.umd.cloud9.demo > DemoWordCount.java ，編輯程式碼如 String filename = "/user/waue/test/132.txt";  ，接著右鍵點run as ... > 選擇之前設定的hadoop file system > 看到console 端 map-reduce 已經在run即可
+   * 安裝IBM mapReduce tool
+. 下載 MapReduce_Tools.zip
+. 關掉 Eclipse -> 解壓縮  MapReduce Tools zip 到 /usr/lib/eclipse/plugins/
+   * 使用 IBM mapReduce tool
+     * 重開啟Eclipse-> 選 File > New > Project ->有 MapReduce category.
+     * 使用教學： Help -> Cheat sheet ->  MapReduce -> Write a MapReduce application
+ === 5/21 ===
+ * 藉著用此篇文章實做設計map-reduce 程式 [http://www.umiacs.umd.edu/~jimmylin/cloud9/umd-hadoop-dist/cloud9-docs/howto/start.html cloud 9] ，以下紀錄我的作法：
+.  Eclipse > Preferences). Select option Team > SVN. Change SVN interface to "SVNKit".
+. by right clicking on left panel > New > Repository Location.
+       * umd-hadoop-dist: https://subversion.umiacs.umd.edu/umd-hadoop/dist
+       * umd-hadoop-core: https://subversion.umiacs.umd.edu/umd-hadoop/core
+. Right click on trunk > Checkout... Follow dialog to check out repository.
+       * ps: 注意 subclipse 相當耗資源，因此執行eclipse 時需加入參數如 "eclipse -vmargs -Xmx512m" 以免遭到 out of memory error
+. switch back to the Java perspective, have two new projects: umd-hadoop-core and umd-hadoop-dist.
+. Select menu option: Project > Clean... (卡住.. 因為發生了九百多個錯誤 )
+. enable the MapReduce servers window go to: Window > Show View > Other... > MapReduce Tools > MapReduce Servers
+.1 At the top right edge of the tab, you should see two little blue elephant icons. The one on the right allows you to add a new MapReduce server location. The hostname should be the IP address of the controller. You want to enable "Tunnel Connections" and put in the IP address of the gateway. （只有看到一隻大象）
+. 2 At this point, you should now have access to DFS. It should show up under a little elephant icon in the Project Explorer (on the left side of Eclipse). You can now browse the directory tree. Your home directory should be /user/your_username. A sample collection consisting of the Bible and Shakespeare's works has been preloaded on the cluster, stored at /shared/sample-input. （卡住）
+. Find edu.umd.cloud9.demo.DemoWordCount in the Project Explorer (卡住，找不到該檔)
+ === 5/20 ===
+ * 使用eclipse build hadoop 有些工具可以使用 1. subclipse (svn in eclipse plugin) 2. IBM mapReduce toolkit (an applicantion of Hadoop in eclipse)
+. 安裝Subclipse
+     * Help -> Software Updates -> Find and Install... -> search for new features ... -> new remote site -> name : subclipse , site : http://subclipse.tigris.org/update
+     * 由於會出現軟體相依性的問題，因此我是加入舊版＆新版的site，並且兩個都勾選安裝，如此一來就可以安裝完成
+     * Window->Show View->Other…-> SVN->SVN Repository -> new site :  http://svn.apache.org/repos/asf/hadoop/core/
+     * 右鍵menu -> Checkout… -> 點擊Finish即完成
+     * 若出現 Problem: Javahl interface is not available 的問題，參照以下解決
+. sudo apt-get install libsvn-javahl libsvn-dev
+. sudo ln -s /usr/lib/jni/libsvnjavahl-1.so  /usr/lib/jvm/java-6-sun/jre/lib/i386/client/libsvnjavahl-1.so
+ === 5/19 ===
+ * 繼續測試在eclipse 編譯 nutch，發現在上面run的都是jar檔，解開後裡面包的是class file ，因此修改程式碼的方法還要研究
+ * 5/16的步驟今天突然無法執行，後來解決的方法如下：
+   * ssh localhost 不可有密碼
+   * 檢查設定檔，如 hadoop.env.sh , nutch.site.xml...
+   * 出現 connect localhost:9000 failed  => 1. hadoop namenode -format 2. startup_all.sh 3. hadoop dfs -put urls urls  之後在執行run
+ === 5/16 ===
+ * 感謝sunni指點迷津，nutch 成功build in nutch
+. File ==> new ==> Project ==> java project ==> Next ==> Project name (設成 nutch0.9) ==> Contents ==> Create project from existing（選擇存放nutch路徑） ==> Finish.
+. 此時會出現366個error , 即使用網路上得除錯方法：將兩個jar( [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-mp3/lib/jid3lib-0.5.1.jar jid3lib-0.5.1.jar] 和 [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-rtf/lib/rtf-parser.jar rtf-parser.jar] )  放入nutch-0.9的lib文件夾下。在Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Librarles ==> Add External JARs... ==> 點選剛下載的兩個jar ==>ok
+.  但此時還是有一堆錯誤，解決的方法是 Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Source ==>將資料夾圖示的都刪掉，僅加入nutch/conf
+. 此時會看到所有的錯誤都解除，接著修改 nutch/conf 內的 nutch-site.xml 、 crawl-urlfilter.txt、hadoop.site.xml、hodoop.env.sh，並在nutch/ 下加入 urls/urls.txt，並將要掃瞄的網址寫入urls.txt
+.  Menu Run > "Run..." ==> create "New" for "Java Application"
+       * set in Main class =  org.apache.nutch.crawl.Crawl
+       * on tab Arguments:
+          * Program Arguments = urls -dir crawl -depth 3 -topN 50
+          * in VM arguments: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log
+       * click on "Run"
+ === 5/15 ===
+ * building nutch in eclipse
+ * 協助 F. P. Lin 申請nchcca憑證 加入pragma grid
+ === 5/14 ===
+ * 將nutch 加入 eclipse 作building 但有錯誤
+. 設定nutch
+. File > New > Project > "Java project" > click Next
+. project 命名為 nutch
+. Select "Create project from existing source" and use the location where you downloaded Nutch
+. Click on Next, and wait while Eclipse is scanning the folders
+. Libraries(第三個tagJ) Add class Floder -> "conf"
+. Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries
+. Set output dir to "tmp_build", create it if necessary
+ === 5/13 ===
+.  檢查並修改trac 上得cps是否為維誠給的 [apgrid同意版]
+. 將word檔內容貼到文字檔 ori.txt
+. 將trac上得內容貼到文字檔 new.txt
+. 消除不同的格式 (如 "\n\n"->"  \n " (換兩行->換一行)、"._"、 "___*_"、":_\n")
+. vimdiff new.txt ori.txt
+. 檢查結果如下 http://trac.nchc.org.tw/gocca/wiki/CPSnew?action=diff&version=27&old_version=26
+ === 5/12  ===
+. 完成 [http://trac.nchc.org.tw/gocca/wiki/CPSnew nchc cp/cps v.1.13] (apgrid 同意) 版本 於 ca網站及trac
+ 主要麻煩為：最後經weicheng確認並認可的是word檔，要轉成html且wiki格式的步驟流程
+. cps 1.1.3 on doc format -> new.txt文字檔
+. cps 1.1.0 on html format in trac -> old.txt文字檔
+. vimdiff new.txt old.txt 檢視不同資訊 並修改 trac上舊的cps 1.1.0資訊成新版的
+. 將新的cps 1.1.3 從trac存到local 並用KompoZer編輯成原ca網站上的格式
+. 上傳並取代舊版
+ === 5/8 ===
+. 基於資安問題將nutch限制瀏覽ip，修改conf/server.xml檔，加入
+{{{
+<Context path="/path/to/secret_files" ...>
+　　<Valve className="org.apache.catalina.valves.RemoteAddrValve"
+　　　　allow="127.0.0.1" deny=""/>
+</Context>
+}}}
+. tomcat 調校方法
+ [http://www.oreilly.com.tw/column_editor.php?id=e137 中文] 、  [http://www.onjava.com/lpt/a/3909 英文]
+ === 5/7 ===
+.  nutch 運作於 管理規範專區成功，並parse進pdf,word內容 改法為在nutch.site.xml加入內容
+{{{
+<property>
+  <name>plugin.includes</name>
+  <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
+  <description>...
+  </description>
+</property>
+}}}
+ parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)內的檔名需要對應plugins中parse-XXX的名稱而定
+ === 5/5 ===
+.  nutch 運作於 管理規範專區成功，但內容卻不包含pdf, word, ...
  = 一～四月工作日誌 =