| 1 | = 五月份 工作計畫 = |
| 2 | == 預定目標 == |
| 3 | * [完成 5/04] 檢查cps是否有錯誤 |
| 4 | * [完成 5/22] hadoop 範例教學 -> 字數統計 |
| 5 | * [完成 5/24] Building Nutch/Hadoop project in eclipse |
| 6 | * [完成 5/27] Programing map-reduce with eclipse 文件 |
| 7 | * [new 5/28 ] map-reduce 程式設計 |
| 8 | * [new ] nutch 應用範例 -> 索引硬碟資料 |
| 9 | == 工作里程 == |
| 10 | === 5/29 === |
| 11 | * 已建立 tw.org.nchc. 的 package 檔以供 java import 之用 |
| 12 | * 研究hadoop 1.16.4 的程式碼,(hadoop 1.17.0 已公佈, 有些api有改變) |
| 13 | * 實做HBase 範例 :方法如下 |
| 14 | 1. [http://ftp.twaren.net/Unix/Web/apache/hadoop/hbase/hbase-0.1.2/ 下載原始碼] 解開 |
| 15 | 2. 設定 conf/hbase-site.xml |
| 16 | {{{ |
| 17 | <configuration> |
| 18 | |
| 19 | <property> |
| 20 | <name>hbase.master</name> |
| 21 | <value>example.org:60000</value> |
| 22 | <description>The host and port that the HBase master runs at. |
| 23 | </description> |
| 24 | </property> |
| 25 | |
| 26 | <property> |
| 27 | <name>hbase.rootdir</name> |
| 28 | <value>hdfs://example.org:9000/hbase</value> |
| 29 | <description>The directory shared by region servers. |
| 30 | </description> |
| 31 | </property> |
| 32 | |
| 33 | </configuration> |
| 34 | }}} |
| 35 | 3. 設定 hbase-env.sh |
| 36 | {{{ |
| 37 | JAVA_HOME=/usr/lib/jvm/java-6-sun |
| 38 | ... |
| 39 | CLASSPATH=/home/waue/workspace/hadoop/conf |
| 40 | }}} |
| 41 | 4. set regionserver , slaves about host |
| 42 | 5. bin/start-hbase.sh (執行此程式前,需確認HDFS有開啟) |
| 43 | * 雖然 hbase 有在運作,但sample code仍報錯 |
| 44 | === 5/28 === |
| 45 | * 研究map-reduce 程式碼 |
| 46 | * hadoop 1.16.2 之後的版本就把hbase分離開來成獨立項目,因此/hadoop/src/java/org/apache/hadoop 已無hbase資料夾項目 (import org.apache.hadoop.hbase.* 將會出錯) |
| 47 | |
| 48 | === 5/27 === |
| 49 | * hadoop技術文件完成 How to Coding Hadoop with Eclipse [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse.odt (odt format)] or [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse.pdf (pdf format)] |
| 50 | |
| 51 | === 5/26 === |
| 52 | * hadoop技術文件 How to Coding Hadoop with Eclipse and svn [http://trac.nchc.org.tw/cloud/browser/hadoop-eclipse_svn.odt (odt format)] |
| 53 | |
| 54 | === 5/23 === |
| 55 | * 文件製作中 |
| 56 | * 開會 |
| 57 | === 5/22 === |
| 58 | * 成功編譯hadoop in Eclipse 並解決昨天的問題 |
| 59 | * 解決錯誤 5. ==> Window > Preferences > java> compiler: 設定 compiler compliance level to 5.0 (變成9個warning) |
| 60 | * 解決錯誤 7.1 ==> add a new MapReduce server location > server name : 任意 、 Hostname : localhost 、 Installation directory: /home/waue/workspace/nutch/ 、 Username : waue |
| 61 | * 解決錯誤 7.2 ==> 其實在執行7.1之前需要先將hadoop filesystem startup 起來才行,並且將範例檔放入hadoop filesystem中如bin/hadoop dfs -put 132.txt test |
| 62 | * 解決錯誤 8 ==> 打開umd-hadoop-core > src > edu.umd.cloud9.demo > DemoWordCount.java ,編輯程式碼如 String filename = "/user/waue/test/132.txt"; ,接著右鍵點run as ... > 選擇之前設定的hadoop file system > 看到console 端 map-reduce 已經在run即可 |
| 63 | * 安裝IBM mapReduce tool |
| 64 | 1. 下載 MapReduce_Tools.zip |
| 65 | 2. 關掉 Eclipse -> 解壓縮 MapReduce Tools zip 到 /usr/lib/eclipse/plugins/ |
| 66 | * 使用 IBM mapReduce tool |
| 67 | * 重開啟Eclipse-> 選 File > New > Project ->有 MapReduce category. |
| 68 | * 使用教學: Help -> Cheat sheet -> MapReduce -> Write a MapReduce application |
| 69 | === 5/21 === |
| 70 | * 藉著用此篇文章實做設計map-reduce 程式 [http://www.umiacs.umd.edu/~jimmylin/cloud9/umd-hadoop-dist/cloud9-docs/howto/start.html cloud 9] ,以下紀錄我的作法: |
| 71 | 1. Eclipse > Preferences). Select option Team > SVN. Change SVN interface to "SVNKit". |
| 72 | 2. by right clicking on left panel > New > Repository Location. |
| 73 | * umd-hadoop-dist: https://subversion.umiacs.umd.edu/umd-hadoop/dist |
| 74 | * umd-hadoop-core: https://subversion.umiacs.umd.edu/umd-hadoop/core |
| 75 | 3. Right click on trunk > Checkout... Follow dialog to check out repository. |
| 76 | * ps: 注意 subclipse 相當耗資源,因此執行eclipse 時需加入參數如 "eclipse -vmargs -Xmx512m" 以免遭到 out of memory error |
| 77 | 4. switch back to the Java perspective, have two new projects: umd-hadoop-core and umd-hadoop-dist. |
| 78 | 5. Select menu option: Project > Clean... (卡住.. 因為發生了九百多個錯誤 ) |
| 79 | 6. enable the MapReduce servers window go to: Window > Show View > Other... > MapReduce Tools > MapReduce Servers |
| 80 | 7.1 At the top right edge of the tab, you should see two little blue elephant icons. The one on the right allows you to add a new MapReduce server location. The hostname should be the IP address of the controller. You want to enable "Tunnel Connections" and put in the IP address of the gateway. (只有看到一隻大象) |
| 81 | 7. 2 At this point, you should now have access to DFS. It should show up under a little elephant icon in the Project Explorer (on the left side of Eclipse). You can now browse the directory tree. Your home directory should be /user/your_username. A sample collection consisting of the Bible and Shakespeare's works has been preloaded on the cluster, stored at /shared/sample-input. (卡住) |
| 82 | 8. Find edu.umd.cloud9.demo.DemoWordCount in the Project Explorer (卡住,找不到該檔) |
| 83 | |
| 84 | === 5/20 === |
| 85 | * 使用eclipse build hadoop 有些工具可以使用 1. subclipse (svn in eclipse plugin) 2. IBM mapReduce toolkit (an applicantion of Hadoop in eclipse) |
| 86 | 1. 安裝Subclipse |
| 87 | * Help -> Software Updates -> Find and Install... -> search for new features ... -> new remote site -> name : subclipse , site : http://subclipse.tigris.org/update |
| 88 | * 由於會出現軟體相依性的問題,因此我是加入舊版&新版的site,並且兩個都勾選安裝,如此一來就可以安裝完成 |
| 89 | * Window->Show View->Other…-> SVN->SVN Repository -> new site : http://svn.apache.org/repos/asf/hadoop/core/ |
| 90 | * 右鍵menu -> Checkout… -> 點擊Finish即完成 |
| 91 | * 若出現 Problem: Javahl interface is not available 的問題,參照以下解決 |
| 92 | 1. sudo apt-get install libsvn-javahl libsvn-dev |
| 93 | 2. sudo ln -s /usr/lib/jni/libsvnjavahl-1.so /usr/lib/jvm/java-6-sun/jre/lib/i386/client/libsvnjavahl-1.so |
| 94 | |
| 95 | === 5/19 === |
| 96 | * 繼續測試在eclipse 編譯 nutch,發現在上面run的都是jar檔,解開後裡面包的是class file ,因此修改程式碼的方法還要研究 |
| 97 | * 5/16的步驟今天突然無法執行,後來解決的方法如下: |
| 98 | * ssh localhost 不可有密碼 |
| 99 | * 檢查設定檔,如 hadoop.env.sh , nutch.site.xml... |
| 100 | * 出現 connect localhost:9000 failed => 1. hadoop namenode -format 2. startup_all.sh 3. hadoop dfs -put urls urls 之後在執行run |
| 101 | === 5/16 === |
| 102 | * 感謝sunni指點迷津,nutch 成功build in nutch |
| 103 | 1. File ==> new ==> Project ==> java project ==> Next ==> Project name (設成 nutch0.9) ==> Contents ==> Create project from existing(選擇存放nutch路徑) ==> Finish. |
| 104 | 2. 此時會出現366個error , 即使用網路上得除錯方法:將兩個jar( [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-mp3/lib/jid3lib-0.5.1.jar jid3lib-0.5.1.jar] 和 [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-rtf/lib/rtf-parser.jar rtf-parser.jar] ) 放入nutch-0.9的lib文件夾下。在Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Librarles ==> Add External JARs... ==> 點選剛下載的兩個jar ==>ok |
| 105 | 3. 但此時還是有一堆錯誤,解決的方法是 Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Source ==>將資料夾圖示的都刪掉,僅加入nutch/conf |
| 106 | 4. 此時會看到所有的錯誤都解除,接著修改 nutch/conf 內的 nutch-site.xml 、 crawl-urlfilter.txt、hadoop.site.xml、hodoop.env.sh,並在nutch/ 下加入 urls/urls.txt,並將要掃描的網址寫入urls.txt |
| 107 | 5. Menu Run > "Run..." ==> create "New" for "Java Application" |
| 108 | * set in Main class = org.apache.nutch.crawl.Crawl |
| 109 | * on tab Arguments: |
| 110 | * Program Arguments = urls -dir crawl -depth 3 -topN 50 |
| 111 | * in VM arguments: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log |
| 112 | * click on "Run" |
| 113 | |
| 114 | |
| 115 | === 5/15 === |
| 116 | * building nutch in eclipse |
| 117 | * 協助 F. P. Lin 申請nchcca憑證 加入pragma grid |
| 118 | |
| 119 | === 5/14 === |
| 120 | * 將nutch 加入 eclipse 作building 但有錯誤 |
| 121 | 0. 設定nutch |
| 122 | 1. File > New > Project > "Java project" > click Next |
| 123 | 2. project 命名為 nutch |
| 124 | 3. Select "Create project from existing source" and use the location where you downloaded Nutch |
| 125 | 4. Click on Next, and wait while Eclipse is scanning the folders |
| 126 | 5. Libraries(第三個tagJ) Add class Floder -> "conf" |
| 127 | 6. Eclipse should have guessed all the java files that must be added on your classpath. If it's not the case, add "src/java", "src/test" and all plugin "src/java" and "src/test" folders to your source folders. Also add all jars in "lib" and in the plugin lib folders to your libraries |
| 128 | 7. Set output dir to "tmp_build", create it if necessary |
| 129 | |
| 130 | |
| 131 | === 5/13 === |
| 132 | 1. 檢查並修改trac 上得cps是否為維誠給的 [apgrid同意版] |
| 133 | 1. 將word檔內容貼到文字檔 ori.txt |
| 134 | 2. 將trac上得內容貼到文字檔 new.txt |
| 135 | 3. 消除不同的格式 (如 "\n\n"->" \n " (換兩行->換一行)、"._"、 "___*_"、":_\n") |
| 136 | 4. vimdiff new.txt ori.txt |
| 137 | 5. 檢查結果如下 http://trac.nchc.org.tw/gocca/wiki/CPSnew?action=diff&version=27&old_version=26 |
| 138 | |
| 139 | === 5/12 === |
| 140 | 1. 完成 [http://trac.nchc.org.tw/gocca/wiki/CPSnew nchc cp/cps v.1.13] (apgrid 同意) 版本 於 ca網站及trac |
| 141 | 主要麻煩為:最後經weicheng確認並認可的是word檔,要轉成html且wiki格式的步驟流程 |
| 142 | 1. cps 1.1.3 on doc format -> new.txt文字檔 |
| 143 | 2. cps 1.1.0 on html format in trac -> old.txt文字檔 |
| 144 | 3. vimdiff new.txt old.txt 檢視不同資訊 並修改 trac上舊的cps 1.1.0資訊成新版的 |
| 145 | 4. 將新的cps 1.1.3 從trac存到local 並用KompoZer編輯成原ca網站上的格式 |
| 146 | 5. 上傳並取代舊版 |
| 147 | === 5/8 === |
| 148 | 1. 基於資安問題將nutch限制瀏覽ip,修改conf/server.xml檔,加入 |
| 149 | {{{ |
| 150 | |
| 151 | <Context path="/path/to/secret_files" ...> |
| 152 | <Valve className="org.apache.catalina.valves.RemoteAddrValve" |
| 153 | allow="127.0.0.1" deny=""/> |
| 154 | </Context> |
| 155 | }}} |
| 156 | |
| 157 | 2. tomcat 調校方法 |
| 158 | [http://www.oreilly.com.tw/column_editor.php?id=e137 中文] 、 [http://www.onjava.com/lpt/a/3909 英文] |
| 159 | |
| 160 | === 5/7 === |
| 161 | 1. nutch 運作於 管理規範專區成功,並parse進pdf,word內容 改法為在nutch.site.xml加入內容 |
| 162 | |
| 163 | {{{ |
| 164 | <property> |
| 165 | <name>plugin.includes</name> |
| 166 | <value>protocol-http|urlfilter-regex|parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> |
| 167 | <description>... |
| 168 | </description> |
| 169 | </property> |
| 170 | }}} |
| 171 | |
| 172 | parse-(text|html|js|pdf|msword|rss|rtf|oo|msexcel|parse-mspowerpoint)內的檔名需要對應plugins中parse-XXX的名稱而定 |
| 173 | |
| 174 | === 5/5 === |
| 175 | 1. nutch 運作於 管理規範專區成功,但內容卻不包含pdf, word, ... |
| 176 | |
| 177 | |