Version 2 (modified by waue, 14 years ago) (diff) |
---|
nutch 1.2 的改變
與nutch 1.0 有許多差異,lucene 的更新,以及索引自庫的關聯方式都不同,try 出以下可能可以完成的方式:
news :
- 更進一步測試發現,無論 /opt/tomcat/webapps/的資料夾名稱為何,只有在 執行 /opt/tomcat/bin/startup.sh 的資料夾下的crawl 才會被建立索引。因此無論/opt/tomcat/webapps/下有多少個資料夾,都只會索引到你執行 startup.sh 時資料夾下的crawl 。
前提
假設索引自庫已經用 bin/nutch crawl 完 http://www.nchc.org.tw/tw/ 後,下載到local 端,路徑為 ~/kkk 。(因此kkk/ 內有 index, indexes,segments,crawldb,linkdb )
tomcat 安裝於 /opt/tomcat/
nutch 安裝於 /opt/nutch/
假設創立一個 0311test的搜尋頁面,
步驟
/opt/tomcat/bin/catalina.sh stop mkdir /opt/tomcat/webapps/0311test/ cp /opt/nutch/nutch-1.2.war /opt/tomcat/webapps/0311test cd /opt/tomcat/webapps/0311test/ jar xvf ./nutch-1.2.war rm nutch-1.2.war; cp -rf ~/kkk ./crawl /opt/tomcat/bin/catalina.sh start
官方網站 http://wiki.apache.org/nutch/NutchTutorial說,訣竅在於,當我們執行 /opt/tomcat/bin/catalina.sh start 時,本身所在目錄要有 crawl 這個資料夾,nutch 搜尋才會正確對應到索引自庫。
Then visit: http://localhost:8080/0311test
NutchBean 驗證
官網有提到,用 NutchBean 驗證索引庫正確性的方法,原文僅提 (http://wiki.apache.org/nutch/NutchTutorial)
Simplest way to verify the integrity of your crawl is to launch NutchBean from command line: bin/nutch org.apache.nutch.searcher.NutchBean apache where apache is the search term (note that NutchBean will only search pages in the crawl directory, so if you named the crawl directory something else, NutchBean will not find any results). After you have verified that the above command returns results you can proceed to setting up the web interface.
但訣竅在於,執行
bin/nutch org.apache.nutch.searcher.NutchBean [搜尋字串] [hdfs上的索引目錄]
因此執行這個程式時,hadoop 四個身份需已經啟動,並且要搜尋的索引庫已經放在 hdfs 上,才搜的到東西
waue@u1004:/opt/tomcat/webapps/0311test$ /opt/nutch/bin/hadoop dfs -ls Found 14 items drwxr-xr-x - waue supergroup 0 2010-11-24 19:26 /user/waue/crawlbek drwxr-xr-x - waue supergroup 0 2010-11-25 09:47 /user/waue/ftp1 drwxr-xr-x - waue supergroup 0 2010-11-26 15:55 /user/waue/t-hfil2 drwxr-xr-x - waue supergroup 0 2010-11-25 18:01 /user/waue/t-hfilter drwxr-xr-x - waue supergroup 0 2010-11-26 16:13 /user/waue/url waue@u1004:/opt/tomcat/webapps/0311test$ /opt/nutch/bin/nutch org.apache.nutch.searcher.NutchBean nchc crawlbek Total hits: 249 0 20101124184700/http://www.nchc.org.tw/en/ ... Reserved|Resolution 1024 * 768| webmaster@nchc.narl.org.tw Latest Update ... th ~ December 10 th , 2010@ NCHC, Taiwan More Southeast Asia International ... 1 20101124184929/http://www.nchc.org.tw/en/e_paper/ ... to Cloud Computing Issue 19:NCHC Establishes a Cloud ... HPC Research - The NCHC’s All New GPU Cluster ... 2 20101124184929/http://www.nchc.org.tw/en/about/publication/message/2010_spring.php ... Collaborative Research Applied Sciences ::: About NCHC Home » About NCHC » Publications » NCHC Newsletter » NCHC Newsletter Spring, 2010, Issue NO ... 3 20101124184929/http://www.nchc.org.tw/en/about/ ... Collaborative Research Applied Sciences ::: About NCHC Home » About NCHC With Taiwan's most bountiful ... Reserved|Resolution 1024 * 768| webmaster@ 4 20101124184929/http://www.nchc.org.tw/en/about/job.php ... Collaborative Research Applied Sciences ::: About NCHC Home » About NCHC » Jobs at NCHC If you would like to ... 5 20101124185839/http://www.nchc.org.tw/en/about/publication/message/ ... Collaborative Research Applied Sciences ::: About NCHC Home » About NCHC » Publications » NCHC Newsletter NCHC Newsletter 2009 Spring 2009Summer 2009 ... 6 20101124184929/http://bioinfo.nchc.org.tw/ Bioinformatics Knowledge Database 國網中心生物知識庫與生物計算服務 ... 7 20101124184929/http://ecogrid.nchc.org.tw/ ... were picked as show cases. NCHC Ecogrid team provided the ... into database in NCHC, consumers can query by a ... 8 20101124184929/http://www.nchc.org.tw/en/research/list.php ... Director History Publications Jobs at NCHC Driving Directions HPC Services Educational ... Wed, November 24, 2010 ::: About NCHC Areas of Service ... 9 20101124184929/http://accta.nchc.org.tw/en/ ACCTA | Login | Home | 中文 | To protect your account and password security, please click ...