wiki:shunfa/2012/0911

Version 1 (modified by shunfa, 12 years ago) (diff)

--

Nutch1.5 + Solr3.6.1

下載

Steps

0. 前置環境設定

安裝JAVA,確認環境變數

$ vim ~/.bashrc

加入下列參數(or其他版本的Java路徑)

export JAVA_HOME=/usr/lib/jvm/java-6-sun/

1. Nutch設定

解壓縮nutch安裝包

$ tar zxvf apache-nutch-1.5-bin.tar.gz
  • 解壓縮的資料路徑,以下開始以_[$NUTCH_HOME]_表示

確認是否可以執行

  • 執行以下指令
    $ [$NUTCH_HOME]/bin/nutch
    
  • 執行結果
    Usage: nutch COMMAND
    where COMMAND is one of:
      crawl             one-step crawler for intranets
      readdb            read / dump crawl db
      mergedb           merge crawldb-s, with optional filtering
      readlinkdb        read / dump link db
      inject            inject new urls into the database
      generate          generate new segments to fetch from crawl db
      freegen           generate new segments to fetch from text files
      fetch             fetch a segment's pages
      parse             parse a segment's pages
      readseg           read / dump segment data
      mergesegs         merge several segments, with optional filtering and slicing
      updatedb          update crawl db from segments after fetching
      invertlinks       create a linkdb from parsed segments
      mergelinkdb       merge linkdb-s, with optional filtering
      solrindex         run the solr indexer on parsed segments and linkdb
      solrdedup         remove duplicates from solr
      solrclean         remove HTTP 301 and 404 documents from solr
      parsechecker      check the parser for a given url
      indexchecker      check the indexing filters for a given url
      domainstats       calculate domain statistics from crawldb
      webgraph          generate a web graph from existing segments
      linkrank          run a link analysis program on the generated web graph
      scoreupdater      updates the crawldb with linkrank scores
      nodedumper        dumps the web graph's node scores
      plugin            load a plugin and run one of its classes main()
      junit             runs the given JUnit test
     or
      CLASSNAME         run the class named CLASSNAME
    Most commands print help when invoked w/o parameters.
    
  • 若出現以上片段,則執行環境OK!

設定爬取機器人名稱

$ vim [$NUTCH_HOME]/conf/nutch-site.xml
  • 加入以下資訊:
    <property>
     <name>http.agent.name</name>
     <value>My Nutch Spider</value>
    </property>
    

設定欲爬取的網址

  • 建立網址資料(以爬取http://www.nchc.)
    $ mkdir -p [$NUTCH_HOME]/urls
    $ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt
    

設定filter

$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt
  • 用下列文字取代原始設定
    # accept anything else
    +.
    

透過指令執行爬取任務

  • 深度3層,每層最多抓取五個文件
    $ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
    solrUrl is not set, indexing will be skipped...
    crawl started in: crawl
    rootUrlDir = urls
    threads = 10
    depth = 3
    solrUrl=null
    topN = 5
    Injector: starting at 2012-09-11 16:25:29
    Injector: crawlDb: crawl/crawldb
    Injector: urlDir: urls
    Injector: Converting injected urls to crawl db entries.
    ...(略)
    
  • 出現以下訊息,表示已經抓取完成

Attachments (1)

Download all attachments as: .zip