Changes between Version 3 and Version 4 of waue/2009/0408


Ignore:
Timestamp:
Apr 8, 2009, 10:23:01 AM (15 years ago)
Author:
waue
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • waue/2009/0408

    v3 v4  
    11 * 根據昨天爬文結果,狀況不如預期,雖然總Job數比之前多,但可以搜尋到的資料很少,更不用提文件檔、pdf檔都沒爬文進去。
    22 * 今天把這個問題徹底解決後,再把完整的資料放上來。
     3 *
     4{{{
     5How can I recover an aborted fetch process?
     6Well, you can not. However, you have two choices to proceed:
     7% touch /index/segments/2005somesegment/fetcher.done
     8% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
     9% bin/nutch generate /index/db/ /index/segments/2005somesegment/
     10% bin/nutch fetch /index/segments/2005somesegment
    311
     12How can I fetch only some sites at a time?
     13# Use -topN to limit the amount of pages all together.
     14# Use -numFetchers to generate multiple small segments.
     15# Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
     16# Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :)
     17
     18How many concurrent threads should I use?
     19ulimit to 65535 (ulimit -n 65535),
     20
     21How can I force fetcher to use custom nutch-config?
     22#Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
     23#Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt
     24#Modify the nutch-default.xml to suite your needs
     25#Set NUTCH_CONF_DIR environment variable to point into the directory you created
     26#run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir.
     27
     28bin/nutch generate generates empty fetchlist, what can I do?
     29Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come...
     30After generate you can call bin/nutch fetch.
     31
     32How can I fetch pages that require Authentication?
     33See HttpAuthenticationSchemes.
     34
     35Is it possible to change the list of common words without crawling everything again?
     36Yes. The list of common words is used only when indexing and searching, and not during other steps.
     37
     38How do I index my local file system
     391) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
     40Change this line:
     41            -^(file|ftp|mailto|https):
     42to this:
     43            -^(http|ftp|mailto|https):
     442) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
     45             # accept anything else +.*
     463) By default the [WWW] "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:
     47        <property>
     48                <name>plugin.includes</name> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
     49        </property>