| 1 | [[PageOutline]] |
| 2 | |
| 3 | {{{ |
| 4 | #!html |
| 5 | <div style="text-align: center;"><big |
| 6 | style="font-weight: bold;"><big><big>Nutch 完整攻略</big></big></big></div> |
| 7 | }}} |
| 8 | |
| 9 | |
| 10 | = 前言 = |
| 11 | * 雖然之前已經測試過了,網路上也有許多人分享過成功的經驗,然而這篇的重點 |
| 12 | * 完整的安裝nutch,並解決中文亂碼問題 |
| 13 | * 用hadoop的角度來架設nutch |
| 14 | * 搜尋引擎不只是找網頁內的資料,也能爬到網頁內的檔案(如pdf,msword) |
| 15 | |
| 16 | = 環境 = |
| 17 | * 目錄 |
| 18 | || /opt/nutch || nutch 家目錄|| |
| 19 | || /opt/nutch_conf || nutch設定檔 || |
| 20 | || /opt/hadoop || hadoop家目錄 || |
| 21 | || /opt/conf || hadoop設定檔 || |
| 22 | || /tmp/hadoop/ [[br]] /tmp/nutch || 日誌檔、中間檔與暫存檔 || |
| 23 | |
| 24 | = step 1 安裝好Hadoop叢集 = |
| 25 | |
| 26 | * 可以參考這篇 [wiki:0330Hadoop_Lab3 hadoop叢集安裝] |
| 27 | * 當然單機版也可以,只是這樣就直接安裝nutch更省事囉!單機安裝nutch可以參考這裡[wiki:waue/2009/0406 nutch單機安裝],但是設定檔要參考這篇的才完整。 |
| 28 | * 安裝好hadoop 叢集之後,/opt/的權限就是使用者的了,並且ssh登入兩台都免密碼,hadoop也能正常執行,並且安裝於/opt/hadoop下,設定檔在 /opt/conf |
| 29 | |
| 30 | = step 2 下載與安裝 = |
| 31 | |
| 32 | == 2.1 下載 nutch 並解壓縮 == |
| 33 | * nutch 1.0 (2009/03/28 release ) |
| 34 | {{{ |
| 35 | $ cd /opt |
| 36 | $ wget http://ftp.twaren.net/Unix/Web/apache/lucene/nutch/nutch-1.0.tar.gz |
| 37 | $ tar -zxvf nutch-1.0.tar.gz |
| 38 | $ mv nutch-1.0.tar.gz nutch |
| 39 | }}} |
| 40 | == 2.2 部屬hadoop,nutch目錄結構 == |
| 41 | {{{ |
| 42 | $ mv nutch/conf ./nutch_conf |
| 43 | $ cp -rf conf/* nutch_conf |
| 44 | $ cp -rf hadoop/* nutch |
| 45 | }}} |
| 46 | * 做完以上動作,nutch的設定檔就會被放在/opt/nutch_conf下,並且把現有hadoop的設定(/opt/conf)帶進nutch的設定中,而nutch_home內的hadoop執行檔也會跟正在運行的hadoop同個版本。 |
| 47 | * 以上的目錄結構在於nutch與hadoop分離,主程式與設定檔分離,(日誌檔則統一被紀錄到/tmp中),這樣的目的在於,要刪除nutch的話直接移除目錄就好,不會動到原本的hadoop。 |
| 48 | |
| 49 | = step 3 編輯設定檔 = |
| 50 | * 所有的設定檔都在 /opt/nutch_conf 下 |
| 51 | == 3.1 hadoop-env.sh == |
| 52 | * 將原本的檔案hadoop-env.sh任意處填入 |
| 53 | {{{ |
| 54 | #!sh |
| 55 | export JAVA_HOME=/usr/lib/jvm/java-6-sun |
| 56 | export HADOOP_HOME=/opt/nutch |
| 57 | export HADOOP_CONF_DIR=/opt/nutch_conf |
| 58 | export HADOOP_SLAVES=$HADOOP_CONF_DIR/slaves |
| 59 | export HADOOP_LOG_DIR=/tmp/hadoop/logs |
| 60 | export HADOOP_PID_DIR=/tmp/hadoop/pid |
| 61 | export NUTCH_HOME=/opt/nutch |
| 62 | export NUTCH_CONF_DIR=/opt/nutch_conf |
| 63 | }}} |
| 64 | * 載入環境設定值 |
| 65 | {{{ |
| 66 | $ source /opt/nutch_conf/hadoop-env.sh |
| 67 | }}} |
| 68 | * ps:強烈建議寫入 /etc/bash.bashrc 中比較萬無一失!! |
| 69 | |
| 70 | == 3.2 hadoop-site.xml == |
| 71 | {{{ |
| 72 | #!sh |
| 73 | <configuration> |
| 74 | <property> |
| 75 | <name>fs.default.name</name> |
| 76 | <value>hdfs://node1:9000/</value> |
| 77 | <description> </description> |
| 78 | </property> |
| 79 | <property> |
| 80 | <name>mapred.job.tracker</name> |
| 81 | <value>node1:9001</value> |
| 82 | <description> </description> |
| 83 | </property> |
| 84 | <property> |
| 85 | <name>hadoop.tmp.dir</name> |
| 86 | <value>/tmp/hadoop/hadoop-${user.name}</value> |
| 87 | <description> </description> |
| 88 | </property> |
| 89 | </configuration> |
| 90 | }}} |
| 91 | == 3.3 nutch-site.xml == |
| 92 | * 重要的設定檔,新增了必要的內容於內,然而想要瞭解更多參數資訊,請見nutch-default.xml |
| 93 | {{{ |
| 94 | #!sh |
| 95 | <configuration> |
| 96 | <property> |
| 97 | <name>http.agent.name</name> |
| 98 | <value>nutch</value> |
| 99 | <description>HTTP 'User-Agent' request header. </description> |
| 100 | </property> |
| 101 | <property> |
| 102 | <name>http.agent.description</name> |
| 103 | <value>nutch-crawl</value> |
| 104 | <description>Further description</description> |
| 105 | </property> |
| 106 | <property> |
| 107 | <name>http.agent.url</name> |
| 108 | <value>node1</value> |
| 109 | <description>A URL to advertise in the User-Agent header. </description> |
| 110 | </property> |
| 111 | <property> |
| 112 | <name>http.agent.email</name> |
| 113 | <value>user@nchc.org.tw</value> |
| 114 | <description>An email address |
| 115 | </description> |
| 116 | </property> |
| 117 | <property> |
| 118 | <name>plugin.folders</name> |
| 119 | <value>/opt/nutch/plugins</value> |
| 120 | <description>Directories where nutch plugins are located. </description> |
| 121 | </property> |
| 122 | <property> |
| 123 | <name>plugin.includes</name> |
| 124 | <value>protocol-(http|httpclient)|urlfilter-regex|parse-(text|html|js|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|swf|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> |
| 125 | <description> Regular expression naming plugin directory names</description> |
| 126 | </property> |
| 127 | <property> |
| 128 | <name>parse.plugin.file</name> |
| 129 | <value>parse-plugins.xml</value> |
| 130 | <description>The name of the file that defines the associations between |
| 131 | content-types and parsers.</description> |
| 132 | </property> |
| 133 | <property> |
| 134 | <name>db.max.outlinks.per.page</name> |
| 135 | <value>-1</value> |
| 136 | <description> </description> |
| 137 | </property> |
| 138 | <property> |
| 139 | <name>http.content.limit</name> |
| 140 | <value>-1</value> |
| 141 | </property> |
| 142 | <property> |
| 143 | <property> |
| 144 | <name>indexer.mergeFactor</name> |
| 145 | <value>500</value> |
| 146 | <description>The factor that determines the frequency of Lucene segment merges. </description> |
| 147 | </property> |
| 148 | <property> |
| 149 | <name>indexer.minMergeDocs</name> |
| 150 | <value>500</value> |
| 151 | <description>This number determines the minimum number of Lucene. </description> |
| 152 | </property> |
| 153 | </configuration> |
| 154 | }}} |
| 155 | == 3.4 slaves == |
| 156 | |
| 157 | * 這個檔不用設定,因為依照hadoop的叢集環境,下面列出我們環境所設定的 |
| 158 | {{{ |
| 159 | #!sh |
| 160 | node1 |
| 161 | node2 |
| 162 | }}} |
| 163 | == 3.5 crawl-urlfilter.txt == |
| 164 | * 重新編輯爬檔規則,此檔重要在於若設定不好,則爬出來的結果幾乎是空的,也就是說最後你的搜尋引擎都找不到資料啦! |
| 165 | {{{ |
| 166 | #!sh |
| 167 | # skip ftp:, & mailto: urls |
| 168 | -^(ftp|mailto): |
| 169 | # skip image and other suffixes we can't yet parse |
| 170 | -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ |
| 171 | # skip URLs containing certain characters as probable queries, etc. |
| 172 | -[*!@] |
| 173 | # accecpt anything else |
| 174 | +.* |
| 175 | }}} |
| 176 | |
| 177 | == 3.6 regex-urlfilter.txt == |
| 178 | * 雖然官方網站鮮少介紹到此檔,但是crawl-urlfilter.txt用來設定爬intranet的規則,而regex-urlfilter.txt則是用來設定爬internet的規則 |
| 179 | |
| 180 | {{{ |
| 181 | $ cd /opt/nutch_conf |
| 182 | $ cp regex-urlfilter.txt regex-urlfilter.txt-bek |
| 183 | $ cp crawl-urlfilter.txt regex-urlfilter.txt |
| 184 | }}} |
| 185 | |
| 186 | = step 4 執行nutch = |
| 187 | |
| 188 | * 在此假設你已經把hadoop 啟動並且正在運作了。因此nutch是利用這個已經在運做的平台上 |
| 189 | * 如果你的hadoop還沒啟動,則請在master節點(此篇以node1當作master)下 bin/start-all.sh指令;如果你的環境很clean,則請在master節點下 |
| 190 | * 到/opt/nutch 或 /opt/hadoop皆可 |
| 191 | {{{ |
| 192 | $ cd /opt/nutch |
| 193 | $ bin/hadoop namenode -format |
| 194 | $ bin/start-all.sh |
| 195 | }}} |
| 196 | |
| 197 | == 4.1 編輯url清單 == |
| 198 | {{{ |
| 199 | $ mkdir urls |
| 200 | $ vim urls.txt |
| 201 | }}} |
| 202 | |
| 203 | {{{ |
| 204 | #!sh |
| 205 | http://www.nchc.org.tw |
| 206 | }}} |
| 207 | |
| 208 | == 4.2 上傳清單到HDFS == |
| 209 | {{{ |
| 210 | $ bin/hadoop -put urls urls |
| 211 | }}} |
| 212 | == 4.3 執行nutch crawl == |
| 213 | * 用下面的指令就可以命令nutch開始工作了,之後map reduce會瘋狂工作 |
| 214 | {{{ |
| 215 | $ bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000 |
| 216 | }}} |
| 217 | * 執行上個指令會把執行過程秀在stdout上。若想要以後慢慢看這些訊息,可以用io導向的方式傾倒於日誌檔 |
| 218 | {{{ |
| 219 | $ bin/nutch crawl urls -dir search -threads 2 -depth 3 -topN 100000 >& nutch.log |
| 220 | }}} |
| 221 | * 在nutch運作的同時,可以在node1節點用瀏覽器,透過 [http://localhost:50030 job管理頁面],[http://localhost:50070 hdfs管理頁面],[http://localhost:50060 程序運作頁面] 來監看程序。 |
| 222 | |
| 223 | = step 5 瀏覽搜尋結果 = |
| 224 | * nutch 在 step 4 的工作是把你寫在urls.txt檔內的網址,用map reduce的程序來進行資料分析,但是分析完之後,要透過tomcat來觀看結果。以下就是安裝與設定你的客製化搜尋引擎的步驟。 |
| 225 | |
| 226 | == 5.1 安裝tomcat == |
| 227 | * 下載tomcat |
| 228 | {{{ |
| 229 | $ cd /opt/ |
| 230 | $ wget http://ftp.twaren.net/Unix/Web/apache/tomcat/tomcat-6/v6.0.18/bin/apache-tomcat-6.0.18.tar.gz |
| 231 | }}} |
| 232 | |
| 233 | * 解壓縮 |
| 234 | {{{ |
| 235 | $ tar -xzvf apache-tomcat-6.0.18.tar.gz |
| 236 | $ mv apache-tomcat-6.0.18 tomcat |
| 237 | }}} |
| 238 | |
| 239 | == 5.1 tomcat server設定 == |
| 240 | |
| 241 | * 修改 /opt/tomcat/conf/server.xml 以修正中文亂碼問題 |
| 242 | {{{ |
| 243 | #!sh |
| 244 | <Connector port="8080" protocol="HTTP/1.1" |
| 245 | connectionTimeout="20000" |
| 246 | redirectPort="8443" URIEncoding="UTF-8" |
| 247 | useBodyEncodingForURI="true" /> |
| 248 | }}} |
| 249 | == 5.3 下載crawl結果 == |
| 250 | |
| 251 | * 先把放在hdfs上,nutch的運算結果下載到local端 |
| 252 | {{{ |
| 253 | $ cd /opt/nutch |
| 254 | $ bin/hadoop dfs -get search /opt/search |
| 255 | }}} |
| 256 | |
| 257 | == 5.4 設定nutch的搜尋引擎頁面到tomcat == |
| 258 | |
| 259 | * 把nutch的搜尋引擎頁面取代為tomcat的webapps/ROOT |
| 260 | {{{ |
| 261 | $ cd /opt/nutch |
| 262 | $ mkdir web |
| 263 | $ cd web |
| 264 | $ jar -xvf nutch-1.0.war |
| 265 | $ rm nutch-1.0.war |
| 266 | $ mv /opt/tomcat/webapps/ROOT /opt/tomcat/webapps/ROOT-ori |
| 267 | $ cd /opt/nutch |
| 268 | $ mv /opt/nutch/web /opt/tomcat/webapps/ROOT |
| 269 | }}} |
| 270 | == 5.5 設定搜尋引擎內容的來源路徑 == |
| 271 | * 5.4的步驟雖然設定好搜尋引擎的頁面,然而其只能當作是介面而已,因此這個步驟把要搜尋的內容與搜尋介面做個連結 |
| 272 | {{{ |
| 273 | $ vim /opt/tomcat/webapps/ROOT/WEB-INF/classes/nutch-site.xml |
| 274 | }}} |
| 275 | |
| 276 | {{{ |
| 277 | #!sh |
| 278 | <configuration> |
| 279 | <property> |
| 280 | <name>searcher.dir</name> |
| 281 | <value>/opt/search</value> |
| 282 | </property> |
| 283 | </configuration> |
| 284 | }}} |
| 285 | |
| 286 | == 5.6 啟動tomcat == |
| 287 | {{{ |
| 288 | $ /opt/tomcat/bin/startup.sh |
| 289 | }}} |
| 290 | |
| 291 | = step 6 享受結果 = |
| 292 | |
| 293 | Enjoy ! [http://localhost:8080] |