Context Navigation

nutch_install

Timestamp:: Apr 23, 2009, 7:16:31 PM (16 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

waue/2009/nutch_install

-                      v5
+                      v6
 || /opt/nutch_conf || nutch設定檔 ||
 || /opt/hadoop || hadoop家目錄 ||
 || /etc/hadoop/conf || hadoop設定檔 ||
+|| /opt/hadoop/conf || hadoop設定檔 ||
 …
 可以用實做一的方法來安裝，然而為了簡化Hadoop安裝，用最簡方式實做
 {{{
 ~$ sudo su -
 ~# echo "deb http://free.nchc.org.tw/debian lenny non-free" > /etc/apt/sources.list.d/lenny-nonfree.list
 ~# echo "deb http://www.classcloud.org unstable main" > /etc/apt/sources.list.d/hadoop.list
 ~# apt-get update
 ~# apt-get install hadoop
+(略 .. 確認 java 授權選項 .. )
 ~# chown -R hadooper /opt/hadoop
+~$ cd /opt/hadoop.nchc.org.tw/~waue/hadoop_nchc.tar.gz
+~$ wget http://hadoop.nchc.org.tw/~waue/hadoop_nchc.tar.gz
+~$ tar -zxvf hadoop_nchc.tar.gz
+~$ chown -R hadooper hadoop
+~$ cd /opt/hadoop
+~$ bin/hadoop namenode -format
+~$ bin/start-all
 }}}
 …
  == 2.2 部屬hadoop,nutch目錄結構 ==
 {{{
-$ mv nutch/conf ./nutch_conf
-$ cp -rf conf/* nutch_conf
 $ cp -rf hadoop/* nutch
+}}}
+ * 做完以上動作，nutch的設定檔就會被放在/opt/nutch_conf下，並且把現有hadoop的設定（/opt/conf）帶進nutch的設定中，而nutch_home內的hadoop執行檔也會跟正在運行的hadoop同個版本。
+ * 以上的目錄結構在於nutch與hadoop分離，主程式與設定檔分離，（日誌檔則統一被紀錄到/tmp中），這樣的目的在於，要刪除nutch的話直接移除目錄就好，不會動到原本的hadoop。
+$ cd nutch
+}}}
 = step 3 編輯設定檔 =
  * 所有的設定檔都在 /opt/nutch_conf 下
+ * 所有的設定檔都在 /opt/nutch/conf 下
 == 3.1 hadoop-env.sh ==
  * 將原本的檔案hadoop-env.sh任意處填入
 …
 export HADOOP_PID_DIR=/tmp/hadoop/pid
 export NUTCH_HOME=/opt/nutch
 export NUTCH_CONF_DIR=/opt/nutch_conf
+export NUTCH_CONF_DIR=/opt/nutch/conf
 }}}
  * 載入環境設定值
 {{{
 $ source /opt/nutch_conf/hadoop-env.sh
+$ source /opt/nutch/conf/hadoop-env.sh
 }}}
  * ps：強烈建議寫入 /etc/bash.bashrc 中比較萬無一失！！
+== 3.2 hadoop-site.xml ==
+{{{
+#!sh
+<configuration>
+  <property>
+    <name>fs.default.name</name>
+    <value>hdfs://node1:9000/</value>
+    <description> </description>
+  </property>
+  <property>
+    <name>mapred.job.tracker</name>
+    <value>node1:9001</value>
+    <description>  </description>
+  </property>
+  <property>
+    <name>hadoop.tmp.dir</name>
+    <value>/tmp/hadoop/hadoop-${user.name}</value>
+    <description> </description>
+  </property>
+</configuration>
+}}}
+== 3.3 nutch-site.xml ==
+== 3.3 conf/nutch-site.xml ==
  * 重要的設定檔，新增了必要的內容於內，然而想要瞭解更多參數資訊，請見nutch-default.xml
+{{{
+$ vim conf/nutch-site.xml
+}}}
 {{{
 #!sh
 …
 <property>
   <name>http.agent.url</name>
   <value>node1</value>
+  <value>localhost</value>
   <description>A URL to advertise in the User-Agent header. </description>
 </property>
 …
 </configuration>
 }}}
+== 3.4 slaves ==
+ * 這個檔不用設定，因為依照hadoop的叢集環境，下面列出我們環境所設定的
+{{{
+#!sh
+node1
+node2
+}}}
 == 3.5 crawl-urlfilter.txt ==
  * 重新編輯爬檔規則，此檔重要在於若設定不好，則爬出來的結果幾乎是空的，也就是說最後你的搜尋引擎都找不到資料啦！
+{{{
+$ vim conf/crawl-urlfilter.txt
+}}}
 {{{
 #!sh
 …
 }}}
+== 3.6 regex-urlfilter.txt ==
+ * 雖然官方網站鮮少介紹到此檔，但是crawl-urlfilter.txt用來設定爬intranet的規則，而regex-urlfilter.txt則是用來設定爬internet的規則
+{{{
+$ cd /opt/nutch_conf
+$ cp regex-urlfilter.txt regex-urlfilter.txt-bek
+$ cp crawl-urlfilter.txt regex-urlfilter.txt
+}}}
 = step 4 執行nutch =
- * 在此假設你已經把hadoop 啟動並且正在運作了。因此nutch是利用這個已經在運做的平台上
- * 如果你的hadoop還沒啟動，則請在master節點(此篇以node1當作master)下 bin/start-all.sh指令；如果你的環境很clean，則請在master節點下
-   * 到/opt/nutch 或 /opt/hadoop皆可
-{{{
-$ cd /opt/nutch
-$ bin/hadoop namenode -format
-$ bin/start-all.sh
-}}}
 == 4.1 編輯url清單 ==
 {{{
 $ mkdir urls
+$ vim urls.txt
+}}}
+{{{
+#!sh
+http://www.nchc.org.tw
+$ echo "http://www.nchc.org.tw" >> ./urls/urls.txt
 }}}
 == 4.2 上傳清單到HDFS ==
 {{{
 $ bin/hadoop -put urls urls
+$ bin/hadoop dfs -put urls urls
 }}}
 == 4.3 執行nutch crawl ==