wiki:waue/2010/0526

Version 4 (modified by waue, 14 years ago) (diff)

--

相關連結

安裝方法

  1. 下載 protocol-smb 最新檔,解壓縮此檔,假定壓縮後的資料夾名稱為 $pro-smb-dir

https://issues.apache.org/jira/secure/attachment/12442365/protocol-smb-dist.zip

  1. 將 $pro-smb-dir/build/plugins/內的 protocol-smb 資料夾 (內的 三個檔案 jcifs-1.3.0.jar plugin.xml protocol-smb.jar)

複製到 $nutch_home/plugin/ 去,

  1. 修改 $nutch_home/conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>protocol-smb| other plugins...</value>
<description>
</description>
</property>
  1. 將 $pro-smb-dir/conf/smb.properties 複製到 $nutch_home/conf/,並設定數值
  1. url 格式為 smb://server/share
  1. 進行 nutch 爬取
    #!/bin/bash
    crawl_dep=$1
    echo $1
    function debug_echo () {
      if [ $? -eq 0 ]; then
          echo "$1 finished "
      else
          echo "$1 is error"
          exit
      fi
    }
    source /opt/nutchez/nutch/conf/hadoop-env.sh
    debug_echo "import hadoop-env.sh"
    echo "delete search (local,hdfs) and urls (hdfs) "
    rm -rf /home/nutchuser/nutchez/search
    /opt/nutchez/nutch/bin/hadoop dfs -rmr urls search
    /opt/nutchez/nutch/bin/hadoop dfs -put /home/nutchuser/nutchez/urls urls
    # 
    /opt/nutchez/nutch/bin/nutch crawl urls -dir search -depth $crawl_dep -topN 5000 -threads 1000
    debug_echo "nutch crawl"
    #
    /opt/nutchez/nutch/bin/hadoop dfs -get search /home/nutchuser/nutchez/search
    debug_echo "download search"
    #
    /opt/nutchez/tomcat/bin/shutdown.sh
    /opt/nutchez/tomcat/bin/startup.sh
    debug_echo "tomcat restart"
    

遇到問題

2010-05-27 14:07:19,417 WARN org.apache.nutch.crawl.Injector: Skipping smb://140.110.138.179/share:java.net.MalformedURLException: unknown protocol: smb
  • 試著用以下方法解決:
    a) a short term solutions will be to installed the JCIFS jar
    library found in protocol-smb folder in
    JDKHOME/jre/lib/ext and (or) JREHOME/lib/ext
    
    b) After completing step a), if the exeception is still thrown
    set the System properties by passing the following arguments
    to the JVM:
    
    -Djava.protocol.handler.pkgs=jcifs
    
    c) You can set the property also in your Code for example if
    you start Crawling with org.apache.nutch.crawl.Crawl
    Add the following two lines. This will be the Same like in b)
    public static void main(String args[]) throws Exception {
    System.setProperty("java.protocol.handler.pkgs", "jcifs");
    new java.util.PropertyPermission("java.protocol.handler.pkgs","read, write")
    //and so on
    
    Also you can visit the FAQ page: http://jcifs.samba.org/src/docs/faq.html
    

並且暴力的把 jcifs.jar 放到 jre/lib/ext/ , nutch/lib/ , nutch 程式執行命令多加-Djava.protocol.handler.pkgs=jcifs

但是此warn 還是沒有解決,以至沒有入口點。

於是到 http://jcifs.samba.org/src/docs/faq.html

自行設計以下的程式來測試 jcifs 專案 http://jcifs.samba.org/src/jcifs-1.3.14.jar

import java.net.MalformedURLException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.GregorianCalendar;

import jcifs.smb.NtlmAuthenticator;
import jcifs.smb.NtlmPasswordAuthentication;
import jcifs.smb.SmbException;
import jcifs.smb.SmbFile;

public class test {

  /**
   * @param args
   * @throws MalformedURLException 
   * @throws SmbException 
   */
  public static void main(String[] args) throws MalformedURLException, SmbException {
    // TODO Auto-generated method stub
    String domain = "WORKSTATION";
    String username = "waue";
    String password = "cccccc";
    String server = "140.110.138.179";
    String share = "share";
    String directory = ".";
    SmbFile[] files = new SmbFile[0];

        NtlmPasswordAuthentication auth = new NtlmPasswordAuthentication(domain, 
                        username, password);
        String smburl = String.format("smb://%s/%s/%s/", server, share, directory);
//        SmbFile file = new SmbFile(smburl, auth);
        SmbFile file = new SmbFile(smburl);
        files = file.listFiles();
        System.err.println("file : ");
        for (SmbFile fi : files){
          System.err.println(fi.getName());
        }
  }
}

得到結果

file : 
【影片】/
人月神話.pdf
其他/
【音樂】/
test.txt
【軟體】/
【照片】/
【遊戲】/

證明此jcifs 在我的電腦可以 work,因此是 protocal-smb 與 nutch 之間的問題

結論

  • 目前還沒將protocal-smb 與 nutch 整合成功