waue/2009/0408 – Cloud Computing

wiki:waue/2009/0408

Context Navigation

Version 10 (modified by waue, 17 years ago) (diff)
--

備份ca日誌
根據昨天爬文結果，狀況不如預期，雖然總Job數比之前多，但可以搜尋到的資料很少，更不用提文件檔、pdf檔都沒爬文進去。

How can I recover an aborted fetch process?
Well, you can not. However, you have two choices to proceed:
% touch /index/segments/2005somesegment/fetcher.done 
% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
% bin/nutch generate /index/db/ /index/segments/2005somesegment/
% bin/nutch fetch /index/segments/2005somesegment

How can I fetch only some sites at a time?
# Use -topN to limit the amount of pages all together.
# Use -numFetchers to generate multiple small segments.
# Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
# Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :)

How many concurrent threads should I use?
ulimit to 65535 (ulimit -n 65535),

How can I force fetcher to use custom nutch-config?
#Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
#Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt
#Modify the nutch-default.xml to suite your needs
#Set NUTCH_CONF_DIR environment variable to point into the directory you created
#run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir.

bin/nutch generate generates empty fetchlist, what can I do?
Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come...
After generate you can call bin/nutch fetch. 

How can I fetch pages that require Authentication?
See HttpAuthenticationSchemes. 

Is it possible to change the list of common words without crawling everything again?
Yes. The list of common words is used only when indexing and searching, and not during other steps.

How do I index my local file system
1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
Change this line:
            -^(file|ftp|mailto|https):
to this:
            -^(http|ftp|mailto|https):
2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
             # accept anything else +.*
3) By default the [WWW] "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:
        <property>
                <name>plugin.includes</name> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
        </property>


What is happening?
By default, the size of the documents downloaded by Nutch is limited (to 65536 bytes). To allow Nutch to download larger files (via HTTP), modify nutch-site.xml and add an entry like this:
            <property>
                  <name>http.content.limit</name> <value>150000</value>
            </property>
      If you do not want to limit the size of downloaded documents, set http.content.limit to a negative value:
            <property>
                  <name>http.content.limit</name> <value>-1</value>
            </property>

How can I find out/display the size and mime type of the hits that a search returns?
<property>
  <name>plugin.includes</name>
  <value>...|index-more|...|query-more|...</value>
  ...
</property>

Nutch doesn't crawl relative URLs? Some pages are not indexed but my regex file and everything else is okay - what is going on?
The crawl tool has a default limitation of 100 outlinks of one page that are being fetched.
 <property>
   <name>db.max.outlinks.per.page</name>
   <value>-1</value>
   <description> </description>
 </property>

上面有些有用的訊息，但不見得可以解決遇到的問題，至少可以看crawl.log ，看他都fetch , index了哪些網址

Download in other formats:

Plain Text