Version 11 (modified by shunfa, 12 years ago) (diff) |
---|
Nutch1.5 + Solr3.6.1
下載
Steps
0. 前置環境設定
安裝JAVA,確認環境變數
$ vim ~/.bashrc
加入下列參數(or其他版本的Java路徑)
export JAVA_HOME=/usr/lib/jvm/java-6-sun/
1. Nutch設定
解壓縮nutch安裝包
$ tar zxvf apache-nutch-1.5-bin.tar.gz
- 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示
確認是否可以執行
- 執行以下指令
$ [$NUTCH_HOME]/bin/nutch
- 執行結果
Usage: nutch COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdb read / dump crawl db mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering solrindex run the solr indexer on parsed segments and linkdb solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr parsechecker check the parser for a given url indexchecker check the indexing filters for a given url domainstats calculate domain statistics from crawldb webgraph generate a web graph from existing segments linkrank run a link analysis program on the generated web graph scoreupdater updates the crawldb with linkrank scores nodedumper dumps the web graph's node scores plugin load a plugin and run one of its classes main() junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
- 若出現以上片段,則執行環境OK!
設定爬取機器人名稱
$ vim [$NUTCH_HOME]/conf/nutch-site.xml
- 加入以下資訊:
<property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
設定欲爬取的網址
- 建立網址資料(以爬取http://www.nchc.)
$ mkdir -p [$NUTCH_HOME]/urls $ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
設定filter
$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
- 用下列文字取代原始設定
# accept anything else +.
透過指令執行爬取任務
- 深度3層,每層最多抓取五個文件
$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5 solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 10 depth = 3 solrUrl=null topN = 5 Injector: starting at 2012-09-11 16:25:29 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. ...(略)
- 出現以下訊息,表示已經抓取完成
...(續) LinkDb: URL filter: true LinkDb: internal links will be ignored. LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164453 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162645 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141825 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164356 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141743 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162554 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164253 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162748 LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911142510 LinkDb: merging with existing linkdb: crawl/linkdb LinkDb: finished at 2012-09-11 16:46:29, elapsed: 00:00:41 crawl finished: crawl
2. 配置Solr
解壓縮Solr
$ tar zxvf apache-solr-3.6.1.tgz
- 解壓縮的資料路徑,以下開始以_[$SOLR_HOME]_表示
配置schema.xml
- 保持良好習慣,修改原始檔案前請先備份
$ mv [$SOLR_HOME]/example/solr/conf/schema.xml [$SOLR_HOME]/example/solr/conf/schema.xml.ori
- 將Nutch的配置複製到Solr中
$ cp [$NUTCH_HOME]/conf/schema.xml [$SOLR_HOME]/example/solr/conf/
啟動Solr
$ cd [$SOLR_HOME]/example $ java -jar start.jar
- 用瀏覽器開啟Solr
http://localhost:8983/solr/admin/ http://localhost:8983/solr/admin/stats.jsp
3. 將Nutch 的 index 匯入至 Solr
$ [$NUTCH_HOME]/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/* SolrIndexer: starting at 2012-09-11 16:59:32 Indexing 99 documents ...
4. 修改Solr設定
- 根據官方文件步驟,無法順利完成"Query",當送出搜尋時,會出現text沒有定義之類的訊息,於是必須修改Solr相關的設定檔
1.修改schema.xml欄位
$ cd [$SOLR_HOME]/example/solr/conf $ vim schema.xml
- 修改行號如下:
# 31: <schema name="nutch" version="1.5"> # 80: <field name="content" type="text" stored="true" indexed="true"/>
2.修改solrconfig.xml欄位
$ cd [$SOLR_HOME]/example/solr/conf $ vim solrconfig.xml
- 修改如下:
把所有的<str name="df">text</str>改成<str name="df">content</str>
- 說明:由於目前的配置是nutch針對solr前一版本所撰寫,新版的配置預設以由text改成content(詳情可見schema.xml中的defaultSearchField項目)
- 完成此一步驟後,你的solr應該可以正常Query,只是相關的UI還需作細部調整。
3.Search UI
- Solr的預設UI為 http://localhost:8389/solr/browse ,但因已經將nutch的schema.xml覆蓋至原本solr的schema.xml,因此如果要繼續使用solr的search UI必須再改寫schema.xml。
- Search UI for Nutch:Solr目錄中,也提供一個簡易的Search Query的結果,路徑為[$SOLR_HOME]/example/conf/xslt/example.xsl,只要網址搜尋後面加上"&wt=xslt&tr=example.xsl"即可看到一個簡易版的html呈現搜尋結果,如: http://localhost:8983/solr/select/?q=hadoop&start=0&rows=10&wt=xslt&tr=example.xsl
4. 網址列的學問
- select/?q=hadoop&start=0&rows=143&wt=xslt&tr=example.xsl
- rows:每頁顯示的數量
- wt: type, 可為json...等
5. index files已經搬家
- 使用crawl指令,只會在你所指定的資料夾產生相對應的爬取結果,至於真正Lucene格式的index會在你下完solrindex產生於$SOLR_HOME/example/solr/data中
6. 置換index files
- 先清空$SOLR_HOME/example/solr/data
- 執行solrindex
- 透過SchemaPage,即可瀏覽index的內容
Attachments (1)
- Alalysis_Page.png (97.0 KB) - added by shunfa 12 years ago.
Download all attachments as: .zip