wiki:shunfa/2012/0911

Context Navigation

← Previous Version
View Latest Version
Next Version →

Version 1 (modified by shunfa, 13 years ago) (diff)
--

Nutch1.5 + Solr3.6.1
1. 下載
2. Steps
  1. 0. 前置環境設定
    1. 安裝JAVA，確認環境變數
  2. 1. Nutch設定

Nutch1.5 + Solr3.6.1

下載

Steps

0. 前置環境設定

安裝JAVA，確認環境變數

$ vim ~/.bashrc

加入下列參數(or其他版本的Java路徑)

export JAVA_HOME=/usr/lib/jvm/java-6-sun/

1. Nutch設定

解壓縮nutch安裝包

$ tar zxvf apache-nutch-1.5-bin.tar.gz

解壓縮的資料路徑，以下開始以_[$NUTCH_HOME]_表示

確認是否可以執行

執行以下指令
```
$ [$NUTCH_HOME]/bin/nutch
```

執行結果

Usage: nutch COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  solrindex         run the solr indexer on parsed segments and linkdb
  solrdedup         remove duplicates from solr
  solrclean         remove HTTP 301 and 404 documents from solr
  parsechecker      check the parser for a given url
  indexchecker      check the indexing filters for a given url
  domainstats       calculate domain statistics from crawldb
  webgraph          generate a web graph from existing segments
  linkrank          run a link analysis program on the generated web graph
  scoreupdater      updates the crawldb with linkrank scores
  nodedumper        dumps the web graph's node scores
  plugin            load a plugin and run one of its classes main()
  junit             runs the given JUnit test
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

若出現以上片段，則執行環境OK!

設定爬取機器人名稱

$ vim [$NUTCH_HOME]/conf/nutch-site.xml

加入以下資訊：

<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

設定欲爬取的網址

建立網址資料(以爬取http://www.nchc.)

$ mkdir -p [$NUTCH_HOME]/urls
$ echo "http://www.nchc.org.tw/tw/" >> [$NUTCH_HOME]/urls/seed.txt

設定filter

$ vim [$NUTCH_HOME]/conf/regex-urlfilter.txt

用下列文字取代原始設定
```
# accept anything else
+.
```

透過指令執行爬取任務

深度3層，每層最多抓取五個文件

$ [$NUTCH_HOME]/bin/nutch crawl urls -dir crawl -depth 3 -topN 5
solrUrl is not set, indexing will be skipped...
crawl started in: crawl
rootUrlDir = urls
threads = 10
depth = 3
solrUrl=null
topN = 5
Injector: starting at 2012-09-11 16:25:29
Injector: crawlDb: crawl/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
...(略)

出現以下訊息，表示已經抓取完成

Attachments (1)

Alalysis_Page.png (97.0 KB) - added by shunfa 13 years ago.

Download all attachments as: .zip

Download in other formats:

Plain Text