wiki:shunfa/2012/0911

Version 11 (modified by shunfa, 12 years ago) (diff)

--

Nutch1.5 + Solr3.6.1

下載

Steps

0. 前置環境設定

安裝JAVA,確認環境變數

$ vim ~/.bashrc

加入下列參數(or其他版本的Java路徑)

export JAVA_HOME=/usr/lib/jvm/java-6-sun/

1. Nutch設定

解壓縮nutch安裝包

$ tar zxvf apache-nutch-1.5-bin.tar.gz
  • 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示

確認是否可以執行

  • 執行以下指令
    $ [$NUTCH_HOME]/bin/nutch
    
  • 執行結果
    Usage: nutch COMMAND
    where COMMAND is one of:
      crawl             one-step crawler for intranets
      readdb            read / dump crawl db
      mergedb           merge crawldb-s, with optional filtering
      readlinkdb        read / dump link db
      inject            inject new urls into the database
      generate          generate new segments to fetch from crawl db
      freegen           generate new segments to fetch from text files
      fetch             fetch a segment's pages
      parse             parse a segment's pages
      readseg           read / dump segment data
      mergesegs         merge several segments, with optional filtering and slicing
      updatedb          update crawl db from segments after fetching
      invertlinks       create a linkdb from parsed segments
      mergelinkdb       merge linkdb-s, with optional filtering
      solrindex         run the solr indexer on parsed segments and linkdb
      solrdedup         remove duplicates from solr
      solrclean         remove HTTP 301 and 404 documents from solr
      parsechecker      check the parser for a given url
      indexchecker      check the indexing filters for a given url
      domainstats       calculate domain statistics from crawldb
      webgraph          generate a web graph from existing segments
      linkrank          run a link analysis program on the generated web graph
      scoreupdater      updates the crawldb with linkrank scores
      nodedumper        dumps the web graph's node scores
      plugin            load a plugin and run one of its classes main()
      junit             runs the given JUnit test
     or
      CLASSNAME         run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.
    
  • 若出現以上片段,則執行環境OK!

設定爬取機器人名稱

$ vim [$NUTCH_HOME]/conf/nutch-site.xml
  • 加入以下資訊:
    <property>
     <name>http.agent.name</name>
     <value>My Nutch Spider</value>
    </property>
    

設定欲爬取的網址

  • 建立網址資料(以爬取http://www.nchc.)
    $ mkdir -p [$NUTCH_HOME]/urls
    $ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
    

設定filter

$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
  • 用下列文字取代原始設定
    # accept anything else
    +.
    

透過指令執行爬取任務

  • 深度3層,每層最多抓取五個文件
    $ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
    solrUrl is not set, indexing will be skipped...
    crawl started in: crawl
    rootUrlDir = urls
    threads = 10
    depth = 3
    solrUrl=null
    topN = 5
    Injector: starting at 2012-09-11 16:25:29
    Injector: crawlDb: crawl/crawldb
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    ...(略)
    
  • 出現以下訊息,表示已經抓取完成
    ...(續)
    LinkDb: URL filter: true
    LinkDb: internal links will be ignored.
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164453
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162645
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141825
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164356
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911141743
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162554
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911164253
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911162748
    LinkDb: adding segment: file:/opt/nutch-1.5.1/crawl/segments/20120911142510
    LinkDb: merging with existing linkdb: crawl/linkdb
    LinkDb: finished at 2012-09-11 16:46:29, elapsed: 00:00:41
    crawl finished: crawl
    

2. 配置Solr

解壓縮Solr

$ tar zxvf apache-solr-3.6.1.tgz
  • 解壓縮的資料路徑,以下開始以_[$SOLR_HOME]_表示

配置schema.xml

  • 保持良好習慣,修改原始檔案前請先備份
    $ mv [$SOLR_HOME]/example/solr/conf/schema.xml [$SOLR_HOME]/example/solr/conf/schema.xml.ori
    
  • 將Nutch的配置複製到Solr中
    $ cp [$NUTCH_HOME]/conf/schema.xml [$SOLR_HOME]/example/solr/conf/
    

啟動Solr

$ cd [$SOLR_HOME]/example
$ java -jar start.jar
  • 用瀏覽器開啟Solr
    http://localhost:8983/solr/admin/
    http://localhost:8983/solr/admin/stats.jsp
    

3. 將Nutch 的 index 匯入至 Solr

$ [$NUTCH_HOME]/bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
SolrIndexer: starting at 2012-09-11 16:59:32
Indexing 99 documents
...

4. 修改Solr設定

  • 根據官方文件步驟,無法順利完成"Query",當送出搜尋時,會出現text沒有定義之類的訊息,於是必須修改Solr相關的設定檔

1.修改schema.xml欄位

$ cd [$SOLR_HOME]/example/solr/conf
$ vim schema.xml
  • 修改行號如下:
    # 31: <schema name="nutch" version="1.5">
    # 80: <field name="content" type="text" stored="true" indexed="true"/>
    

2.修改solrconfig.xml欄位

$ cd [$SOLR_HOME]/example/solr/conf
$ vim solrconfig.xml
  • 修改如下:
    把所有的<str name="df">text</str>改成<str name="df">content</str>
    
  • 說明:由於目前的配置是nutch針對solr前一版本所撰寫,新版的配置預設以由text改成content(詳情可見schema.xml中的defaultSearchField項目)
  • 完成此一步驟後,你的solr應該可以正常Query,只是相關的UI還需作細部調整。

3.Search UI

4. 網址列的學問

  • select/?q=hadoop&start=0&rows=143&wt=xslt&tr=example.xsl
    • rows:每頁顯示的數量
    • wt: type, 可為json...等

5. index files已經搬家

  • 使用crawl指令,只會在你所指定的資料夾產生相對應的爬取結果,至於真正Lucene格式的index會在你下完solrindex產生於$SOLR_HOME/example/solr/data中

6. 置換index files

  • 先清空$SOLR_HOME/example/solr/data
  • 執行solrindex
  • 透過SchemaPage,即可瀏覽index的內容

Attachments (1)

Download all attachments as: .zip