wiki:waue/2009/0406

Context Navigation

Version 5 (modified by waue, 17 years ago) (diff)
--

Nutch 安裝測試

Nutch 安裝測試

前言

之前有安裝過nutch （version 0.9）並成功運作於四台主機上。由於想到之後上課可能有需要，再重新操作一次
- 網址： nutch
這次的測試與之前的測試不同點在於：
1. 版本是新的（nutch 1.0）
2. 之前是空的環境下直接安裝nutch，也沒有hadoop的基礎來安裝，因此目錄結構都用nutch官網介紹的；然而這次的安裝測試在於把nutch運行在已經有的hadoop之上。不過測試的結果是失敗了，錯誤訊息在於找不到dfs之類的訊息。
於是又再退回最原始的方法，用空的環境架nutch，並且所有的安裝都用最簡單的設定，步驟如下：

step 1 登入免密碼

這是最基本的，怎麼做就不贅述。

step 2 下載與安裝

下載 java 1.6
```
$ sudo apt-get install sun-java6-bin
```

下載 nutch 1.0 (2009/03/28)

$ wget http://ftp.twaren.net/Unix/Web/apache/lucene/nutch/nutch-1.0.tar.gz

step 3 編輯設定檔

所有的設定檔都在 $NUTCH_HOME/conf 下

3.1 hadoop-env.sh

將原本的檔案hadoop-env.sh任意處插入

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/nutch
export HADOOP_LOG_DIR=/tmp/nutch/logs
export HADOOP_SLAVES=/opt/nutch/conf/slaves

3.2 hadoop-site.xml

<configuration>
<property>
    <name>fs.default.name</name>
    <value>gm1.nchc.org.tw:9000</value>
    <description> The name of the default file system. Either the literal string "local" or a host:port for NDFS. </description>
</property>
<property>
    <name>mapred.job.tracker</name>
    <value>gm1.nchc.org.tw:9001</value>
    <description> The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description>
</property>
</configuration>

3.3 nutch-site.xml

<configuration>
<property>
  <name>http.agent.name</name>
  <value>waue</value>
  <description>HTTP 'User-Agent' request header. </description>
</property>
<property>
  <name>http.agent.description</name>
  <value>MyTest</value>
  <description>Further description</description>
</property>
<property>
  <name>http.agent.url</name>
  <value>gm1.nchc.org.tw</value>
  <description>A URL to advertise in the User-Agent header. </description>
</property>
<property>
  <name>http.agent.email</name>
  <value>waue@nchc.org.tw</value>
  <description>An email address
  </description>
</property>
</configuration>

3.4 slaves

其實不用改，因為原本就是localhost

localhost

3.5 crawl-urlfilter.txt

將此檔的兩行改為下面內容

# skip URLs containing certain characters as probable queries, etc.

-[*!@]




# accept hosts in MY.DOMAIN.NAME

+^http://([a-z0-9]*\.)*.*/

step 4 執行

4.1 編輯url清單

$ mkdir urls
$ vim urls.txt

http://lucene.apache.org

4.2 開啟HDFS

$ bin/hadoop namenode -format
$ bin/start-all.sh

4.3 上傳清單到HDFS

$ bin/hadoop -put urls urls

4.4 執行nutch crawl

$ bin/nutch crawl urls -dir crawl01 -depth 3

Context Navigation

Nutch 安裝測試

前言

step 1 登入免密碼

step 2 下載與安裝

step 3 編輯設定檔

3.1 hadoop-env.sh

3.2 hadoop-site.xml

3.3 nutch-site.xml

3.4 slaves

3.5 crawl-urlfilter.txt

step 4 執行

4.1 編輯url清單

4.2 開啟HDFS

4.3 上傳清單到HDFS

4.4 執行nutch crawl

step 5 web瀏覽

5.1 安裝tomcat

5.2 瀏覽crawl結果

Download in other formats: