Version 4 (modified by waue, 15 years ago) (diff) |
---|
相關連結
安裝方法
- 下載 protocol-smb 最新檔,解壓縮此檔,假定壓縮後的資料夾名稱為 $pro-smb-dir
https://issues.apache.org/jira/secure/attachment/12442365/protocol-smb-dist.zip
- 將 $pro-smb-dir/build/plugins/內的 protocol-smb 資料夾 (內的 三個檔案 jcifs-1.3.0.jar plugin.xml protocol-smb.jar)
複製到 $nutch_home/plugin/ 去,
- 修改 $nutch_home/conf/nutch-site.xml
<property> <name>plugin.includes</name> <value>protocol-smb| other plugins...</value> <description> </description> </property>
- 將 $pro-smb-dir/conf/smb.properties 複製到 $nutch_home/conf/,並設定數值
- url 格式為 smb://server/share
- 進行 nutch 爬取
#!/bin/bash crawl_dep=$1 echo $1 function debug_echo () { if [ $? -eq 0 ]; then echo "$1 finished " else echo "$1 is error" exit fi } source /opt/nutchez/nutch/conf/hadoop-env.sh debug_echo "import hadoop-env.sh" echo "delete search (local,hdfs) and urls (hdfs) " rm -rf /home/nutchuser/nutchez/search /opt/nutchez/nutch/bin/hadoop dfs -rmr urls search /opt/nutchez/nutch/bin/hadoop dfs -put /home/nutchuser/nutchez/urls urls # /opt/nutchez/nutch/bin/nutch crawl urls -dir search -depth $crawl_dep -topN 5000 -threads 1000 debug_echo "nutch crawl" # /opt/nutchez/nutch/bin/hadoop dfs -get search /home/nutchuser/nutchez/search debug_echo "download search" # /opt/nutchez/tomcat/bin/shutdown.sh /opt/nutchez/tomcat/bin/startup.sh debug_echo "tomcat restart"
遇到問題
2010-05-27 14:07:19,417 WARN org.apache.nutch.crawl.Injector: Skipping smb://140.110.138.179/share:java.net.MalformedURLException: unknown protocol: smb
- 試著用以下方法解決:
a) a short term solutions will be to installed the JCIFS jar library found in protocol-smb folder in JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext b) After completing step a), if the exeception is still thrown set the System properties by passing the following arguments to the JVM: -Djava.protocol.handler.pkgs=jcifs c) You can set the property also in your Code for example if you start Crawling with org.apache.nutch.crawl.Crawl Add the following two lines. This will be the Same like in b) public static void main(String args[]) throws Exception { System.setProperty("java.protocol.handler.pkgs", "jcifs"); new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write") //and so on Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html
並且暴力的把 jcifs.jar 放到 jre/lib/ext/ , nutch/lib/ , nutch 程式執行命令多加-Djava.protocol.handler.pkgs=jcifs
但是此warn 還是沒有解決,以至沒有入口點。
於是到 http://jcifs.samba.org/src/docs/faq.html
自行設計以下的程式來測試 jcifs 專案 http://jcifs.samba.org/src/jcifs-1.3.14.jar
import java.net.MalformedURLException; import java.text.SimpleDateFormat; import java.util.Date; import java.util.GregorianCalendar; import jcifs.smb.NtlmAuthenticator; import jcifs.smb.NtlmPasswordAuthentication; import jcifs.smb.SmbException; import jcifs.smb.SmbFile; public class test { /** * @param args * @throws MalformedURLException * @throws SmbException */ public static void main(String[] args) throws MalformedURLException, SmbException { // TODO Auto-generated method stub String domain = "WORKSTATION"; String username = "waue"; String password = "cccccc"; String server = "140.110.138.179"; String share = "share"; String directory = "."; SmbFile[] files = new SmbFile[0]; NtlmPasswordAuthentication auth = new NtlmPasswordAuthentication(domain, username, password); String smburl = String.format("smb://%s/%s/%s/", server, share, directory); // SmbFile file = new SmbFile(smburl, auth); SmbFile file = new SmbFile(smburl); files = file.listFiles(); System.err.println("file : "); for (SmbFile fi : files){ System.err.println(fi.getName()); } } }
得到結果
file : 【影片】/ 人月神話.pdf 其他/ 【音樂】/ test.txt 【軟體】/ 【照片】/ 【遊戲】/
證明此jcifs 在我的電腦可以 work,因此是 protocal-smb 與 nutch 之間的問題
結論
- 目前還沒將protocal-smb 與 nutch 整合成功