[[PageOutline]]
= Nutch1.5 + Solr3.6.1 =
== 下載 ==
* [http://apache.stu.edu.tw/nutch/1.5/apache-nutch-1.5-bin.tar.gz Nutch1.5]
* [http://ftp.twaren.net/Unix/Web/apache/lucene/solr/3.6.1/apache-solr-3.6.1.tgz Solr3.6.1]
== Steps ==
=== 0. 前置環境設定 ===
==== 安裝JAVA,確認環境變數 ====
{{{
$ vim ~/.bashrc
}}}
加入下列參數(or其他版本的Java路徑)
{{{
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
}}}
=== 1. Nutch設定 ===
==== 解壓縮nutch安裝包 ====
{{{
$ tar zxvf apache-nutch-1.5-bin.tar.gz
}}}
* 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示
==== 確認是否可以執行 ====
* 執行以下指令
{{{
$ [$NUTCH_HOME]/bin/nutch
}}}
* 執行結果
{{{
Usage: nutch COMMAND
where COMMAND is one of:
crawl one-step crawler for intranets
readdb read / dump crawl db
mergedb merge crawldb-s, with optional filtering
readlinkdb read / dump link db
inject inject new urls into the database
generate generate new segments to fetch from crawl db
freegen generate new segments to fetch from text files
fetch fetch a segment's pages
parse parse a segment's pages
readseg read / dump segment data
mergesegs merge several segments, with optional filtering and slicing
updatedb update crawl db from segments after fetching
invertlinks create a linkdb from parsed segments
mergelinkdb merge linkdb-s, with optional filtering
solrindex run the solr indexer on parsed segments and linkdb
solrdedup remove duplicates from solr
solrclean remove HTTP 301 and 404 documents from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
domainstats calculate domain statistics from crawldb
webgraph generate a web graph from existing segments
linkrank run a link analysis program on the generated web graph
scoreupdater updates the crawldb with linkrank scores
nodedumper dumps the web graph's node scores
plugin load a plugin and run one of its classes main()
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
}}}
* 若出現以上片段,則執行環境OK!
==== 設定爬取機器人名稱 ====
{{{
$ vim [$NUTCH_HOME]/conf/nutch-site.xml
}}}
* 加入以下資訊:
{{{
#!text
http.agent.name
My Nutch Spider
}}}
==== 設定欲爬取的網址 ====
* 建立網址資料(以爬取http://www.nchc.)
{{{
$ mkdir -p [$NUTCH_HOME]/urls
$ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
}}}
==== 設定filter ====
{{{
$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
}}}
* 用下列文字取代原始設定
{{{
#!text
# accept anything else
+.
}}}
==== 透過指令執行爬取任務 ====
* 深度3層,每層最多抓取五個文件
{{{
$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-09-11 16:25:29
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
...(略)
}}}
* 出現以下訊息,表示已經抓取完成
{{{
...(續)
LinkDb: URL filter: true
LinkDb: internal links will be ignored.
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164453
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162645
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141825
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164356
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141743
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162554
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164253
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162748
LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911142510
LinkDb: merging with existing linkdb: crawl/linkdb
LinkDb: finished at 2012-09-11 16:46:29, elapsed: 00:00:41
crawl finished: crawl
}}}
=== 2. 配置Solr ===
==== 解壓縮Solr ====
{{{
$ tar zxvf apache-solr-3.6.1.tgz
}}}
* 解壓縮的資料路徑,以下開始以_[$SOLR_HOME]_表示
==== 配置schema.xml ====
* 保持良好習慣,修改原始檔案前請先備份
{{{
$ mv [$SOLR_HOME]/example/solr/conf/schema.xml [$SOLR_HOME]/example/solr/conf/schema.xml.ori
}}}
* 將Nutch的配置複製到Solr中
{{{
$ cp [$NUTCH_HOME]/conf/schema.xml [$SOLR_HOME]/example/solr/conf/
}}}
==== 啟動Solr ====
{{{
$ cd [$SOLR_HOME]/example
$ java -jar start.jar
}}}
* 用瀏覽器開啟Solr
{{{
#!text
http://localhost:8983/solr/admin/
http://localhost:8983/solr/admin/stats.jsp
}}}
==== 番外篇:加入中文分詞作法 ====
* 修改schema.xml
{{{
$ vim [$SOLR_HOME]/example/solr/conf/schema.xml
}}}
* 加入以下:(將原本的solr.TextField取代)
{{{
#!text
}}}
* 將IKAnalyzer加入至以下路徑(本例使用3.2.8版)
{{{
#!text
$SOLR_HOME/example/work/$WEB_PATH/webapp/WEB-INF/lib/
}}}
* 測試,於[http://localhost:8983/solr/admin/analysis.jsp Analysis頁面]輸入測試文字,出現以下圖例即表成功!
[[Image(shunfa/2012/0911:Alalysis_Page.png,width=800)]]
=== 3. 將Nutch 的 index 匯入至 Solr ===
{{{
$ [$NUTCH_HOME]/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2012-09-11 16:59:32
Indexing 99 documents
...
}}}
=== 4. 修改Solr設定 ===
* 根據官方文件步驟,無法順利完成"Query",當送出搜尋時,會出現text沒有定義之類的訊息,於是必須修改Solr相關的設定檔
==== 1.修改schema.xml欄位 ====
{{{
$ cd [$SOLR_HOME]/example/solr/conf
$ vim schema.xml
}}}
* 修改行號如下:
{{{
#!text
# 31:
# 80:
}}}
==== 2.修改solrconfig.xml欄位 ====
{{{
$ cd [$SOLR_HOME]/example/solr/conf
$ vim solrconfig.xml
}}}
* 修改如下:
{{{
#!text
把所有的text改成content
}}}
* 說明:由於目前的配置是nutch針對solr前一版本所撰寫,新版的配置預設以由text改成content(詳情可見schema.xml中的defaultSearchField項目)
* 完成此一步驟後,你的solr應該可以正常Query,只是相關的UI還需作細部調整。
==== 3.Search UI ====
* Solr的預設UI為 http://localhost:8389/solr/browse ,但因已經將nutch的schema.xml覆蓋至原本solr的schema.xml,因此如果要繼續使用solr的search UI必須再改寫schema.xml。
* Search UI for Nutch:Solr目錄中,也提供一個簡易的Search Query的結果,路徑為[$SOLR_HOME]/example/conf/xslt/example.xsl,只要網址搜尋後面加上"&wt=xslt&tr=example.xsl"即可看到一個簡易版的html呈現搜尋結果,如: http://localhost:8983/solr/select/?q=hadoop&start=0&rows=10&wt=xslt&tr=example.xsl
==== 4. 網址列的學問 ====
* select/?q=hadoop&start=0&rows=143&wt=xslt&tr=example.xsl
* rows:每頁顯示的數量
* wt: type, 可為json...等
==== 5. index files已經搬家 ====
* 使用crawl指令,只會在你所指定的資料夾產生相對應的爬取結果,至於真正Lucene格式的index會在你下完solrindex產生於$SOLR_HOME/example/solr/data中
==== 6. 置換index files ====
* 先清空$SOLR_HOME/example/solr/data
* 執行solrindex
* 透過[http://localhost:8983/solr/admin/schema.jsp SchemaPage],即可瀏覽index的內容