[[PageOutline]]
{{{
#!html
Nutch 完整攻略
}}}
= 前言 =
* 雖然之前已經測試過了,網路上也有許多人分享過成功的經驗,然而這篇的重點
* 完整的安裝nutch,並解決中文亂碼問題
* 用hadoop的角度來架設nutch
* 搜尋引擎不只是找網頁內的資料,也能爬到網頁內的檔案(如pdf,msword)
= 環境 =
* 目錄
|| /opt/nutch || nutch 家目錄||
|| /opt/nutch_conf || nutch設定檔 ||
|| /opt/hadoop || hadoop家目錄 ||
|| /opt/hadoop/conf || hadoop設定檔 ||
= step 1 安裝好Hadoop =
可以用實做一的方法來安裝
* 執行
{{{
~$ cd /opt
/opt$ sudo wget http://ftp.twaren.net/Unix/Web/apache/hadoop/core/hadoop-0.18.3/hadoop-0.18.3.tar.gz
/opt$ sudo tar zxvf hadoop-0.18.3.tar.gz
/opt$ sudo mv hadoop-0.18.3/ hadoop
/opt$ sudo chown -R hadooper:hadooper hadoop
/opt$ cd hadoop/
/opt/hadoop$ gedit conf/hadoop-env.sh
}}}
在任一行內貼上
{{{
#!sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/conf
}}}
* 執行
{{{
/opt/hadoop$ gedit conf/hadoop-site.xml
}}}
用以下內容取代整個檔案
{{{
#!sh
fs.default.name
hdfs://localhost:9000
default file system for NDFS
mapred.job.tracker
localhost:9001
The host:port that job tracker runs at.
}}}
* 接著執行
{{{
/opt/hadoop$ bin/hadoop namenode -format
/opt/hadoop$ bin/start-all.sh
}}}
* 啟動之後,可以檢查以下網址,來觀看服務是否正常。[http://localhost:50030/ Hadoop 管理介面] [http://localhost:50060/ Hadoop Task Tracker 狀態] [http://localhost:50070/ Hadoop DFS 狀態]
= step 2 nutch下載與安裝 =
== 2.1 下載 nutch 並解壓縮 ==
* nutch 1.0 (2009/03/28 release )
{{{
$ cd /opt
$ wget http://ftp.twaren.net/Unix/Web/apache/lucene/nutch/nutch-1.0.tar.gz
$ tar -zxvf nutch-1.0.tar.gz
$ mv nutch-1.0 nutch
}}}
== 2.2 部屬hadoop,nutch目錄結構 ==
{{{
$ cp -rf /opt/hadoop/* /opt/nutch
}}}
== 2.3 複製函式庫檔 ==
{{{
$ cd nutch
$ cp -rf *.jar lib/
}}}
= step 3 編輯設定檔 =
* 所有的設定檔都在 /opt/nutch/conf 下
== 3.1 hadoop-env.sh ==
* 將原本的檔案hadoop-env.sh任意處填入
{{{
#!sh
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/nutch
export HADOOP_CONF_DIR=/opt/nutch/conf
export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves
export HADOOP_LOG_DIR=/tmp/hadoop/logs
export HADOOP_PID_DIR=/tmp/hadoop/pid
export NUTCH_HOME=/opt/nutch
export NUTCH_CONF_DIR=/opt/nutch/conf
}}}
* 載入環境設定值
{{{
$ source /opt/nutch/conf/hadoop-env.sh
}}}
* ps:強烈建議寫入 /etc/bash.bashrc 中比較萬無一失!!
== 3.2 conf/nutch-site.xml ==
* 重要的設定檔,新增了必要的內容於內,然而想要瞭解更多參數資訊,請見nutch-default.xml
{{{
$ vim conf/nutch-site.xml
}}}
{{{
#!sh
http.agent.name
nutch
HTTP 'User-Agent' request header.
http.agent.description
MyTest
Further description
http.agent.url
localhost
A URL to advertise in the User-Agent header.
http.agent.email
test@test.org.tw
An email address
plugin.folders
/opt/nutch/plugins
Directories where nutch plugins are located.
plugin.includes
protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin directory names
parse.plugin.file
parse-plugins.xml
The name of the file that defines the associations between
content-types and parsers.
db.max.outlinks.per.page
-1
http.content.limit
-1
indexer.mergeFactor
500
The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
indexer.minMergeDocs
500
This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
}}}
== 3.3 crawl-urlfilter.txt ==
* 重新編輯爬檔規則,此檔重要在於若設定不好,則爬出來的結果幾乎是空的,也就是說最後你的搜尋引擎都找不到資料啦!
{{{
$ vim conf/crawl-urlfilter.txt
}}}
{{{
#!sh
# skip ftp:, & mailto: urls
-^(ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
-[*!@]
# accecpt anything else
+.*
}}}
= step 4 執行nutch =
== 4.1 編輯url清單 ==
{{{
$ mkdir urls
$ echo "http://www.nchc.org.tw" >> ./urls/urls.txt
}}}
== 4.2 上傳清單到HDFS ==
{{{
$ bin/hadoop dfs -put urls urls
}}}
== 4.3 執行nutch crawl ==
* 用下面的指令就可以命令nutch開始工作了,之後map reduce會瘋狂工作
{{{
$ bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000
}}}
* 執行上個指令會把執行過程秀在stdout上。若想要以後慢慢看這些訊息,可以用io導向的方式傾倒於日誌檔
{{{
$ bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000 >& nutch.log
}}}
* 在nutch運作的同時,可以在node1節點用瀏覽器,透過 [http://localhost:50030 job管理頁面],[http://localhost:50070 hdfs管理頁面],[http://localhost:50060 程序運作頁面] 來監看程序。
ps: 重要!!! 如果錯誤訊息出現
{{{
Exception in thread "main" java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.dfs.DistributedFileSystem
}}}
則代表之前沒有做此2.2的"複製函式庫檔"的步驟,請將hadoop-0.18.3*.jar 拷貝到lib中再執行一次即可
= step 5 瀏覽搜尋結果 =
* nutch 在 step 4 的工作是把你寫在urls.txt檔內的網址,用map reduce的程序來進行資料分析,但是分析完之後,要透過tomcat來觀看結果。以下就是安裝與設定你的客製化搜尋引擎的步驟。
== 5.1 安裝tomcat ==
* 下載tomcat
{{{
$ cd /opt/
$ wget http://ftp.twaren.net/Unix/Web/apache/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz
}}}
* 解壓縮
{{{
$ tar -xzvf apache-tomcat-6.0.18.tar.gz
$ mv apache-tomcat-6.0.18 tomcat
}}}
== 5.1 tomcat server設定 ==
* 修改 /opt/tomcat/conf/server.xml 以修正中文亂碼問題
{{{
#!sh
}}}
== 5.3 下載crawl結果 ==
* 先把放在hdfs上,nutch的運算結果下載到local端
{{{
$ cd /opt/nutch
$ bin/hadoop dfs -get search /opt/search
}}}
== 5.4 設定nutch的搜尋引擎頁面到tomcat ==
* 把nutch的搜尋引擎頁面取代為tomcat的webapps/ROOT
{{{
$ cd /opt/nutch
$ mkdir web
$ cd web
$ jar -xvf nutch-1.0.war
$ rm nutch-1.0.war
$ mv /opt/tomcat/webapps/ROOT /opt/tomcat/webapps/ROOT-ori
$ cd /opt/nutch
$ mv /opt/nutch/web /opt/tomcat/webapps/ROOT
}}}
== 5.5 設定搜尋引擎內容的來源路徑 ==
* 5.4的步驟雖然設定好搜尋引擎的頁面,然而其只能當作是介面而已,因此這個步驟把要搜尋的內容與搜尋介面做個連結
{{{
$ vim /opt/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml
}}}
{{{
#!sh
searcher.dir
/opt/search
}}}
== 5.6 啟動tomcat ==
{{{
$ /opt/tomcat/bin/startup.sh
}}}
= step 6 享受結果 =
Enjoy ! [http://localhost:8080]