| 8 | === 5/16 === |
| 9 | * 感謝sunni指點迷津,nutch 成功build in nutch |
| 10 | 1. File ==> new ==> Project ==> java project ==> Next ==> Project name (設成 nutch0.9) ==> Contents ==> Create project from existing(選擇存放nutch路徑) ==> Finish. |
| 11 | 2. 此時會出現366個error , 即使用網路上得除錯方法:將兩個jar( [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-mp3/lib/jid3lib-0.5.1.jar jid3lib-0.5.1.jar] 和 [http://nutch.cvs.sourceforge.net/*checkout*/nutch/nutch/src/plugin/parse-rtf/lib/rtf-parser.jar rtf-parser.jar] ) 放入nutch-0.9的lib文件夾下。在Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Librarles ==> Add External JARs... ==> 點選剛下載的兩個jar ==>ok |
| 12 | 3. 但此時還是有一堆錯誤,解決的方法是 Eelipse中右鍵點擊 nutch0.9 ==> properties.. ==> Java Build Path ==> Source ==>將資料夾圖示的都刪掉,僅加入nutch/conf |
| 13 | 4. 此時會看到所有的錯誤都解除,接著修改 nutch/conf 內的 nutch-site.xml 、 crawl-urlfilter.txt、hadoop.site.xml、hodoop.env.sh,並在nutch/ 下加入 urls/urls.txt,並將要掃描的網址寫入urls.txt |
| 14 | 5. Menu Run > "Run..." ==> create "New" for "Java Application" |
| 15 | * set in Main class = org.apache.nutch.crawl.Crawl |
| 16 | * on tab Arguments: |
| 17 | * Program Arguments = urls -dir crawl -depth 3 -topN 50 |
| 18 | * in VM arguments: -Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log |
| 19 | * click on "Run" |
| 20 | |
| 21 | |