Context Navigation

0408

Timestamp:: Apr 8, 2009, 10:23:01 AM (17 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

waue/2009/0408

-                      v3
+                      v4
  * 根據昨天爬文結果，狀況不如預期，雖然總Job數比之前多，但可以搜尋到的資料很少，更不用提文件檔、pdf檔都沒爬文進去。
  * 今天把這個問題徹底解決後，再把完整的資料放上來。
+ *
+{{{
+How can I recover an aborted fetch process?
+Well, you can not. However, you have two choices to proceed:
+% touch /index/segments/2005somesegment/fetcher.done
+% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
+% bin/nutch generate /index/db/ /index/segments/2005somesegment/
+% bin/nutch fetch /index/segments/2005somesegment
+How can I fetch only some sites at a time?
+# Use -topN to limit the amount of pages all together.
+# Use -numFetchers to generate multiple small segments.
+# Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb.
+# Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :)
+How many concurrent threads should I use?
+ulimit to 65535 (ulimit -n 65535),
+How can I force fetcher to use custom nutch-config?
+#Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
+#Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt
+#Modify the nutch-default.xml to suite your needs
+#Set NUTCH_CONF_DIR environment variable to point into the directory you created
+#run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir.
+bin/nutch generate generates empty fetchlist, what can I do?
+Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come...
+After generate you can call bin/nutch fetch.
+How can I fetch pages that require Authentication?
+See HttpAuthenticationSchemes.
+Is it possible to change the list of common words without crawling everything again?
+Yes. The list of common words is used only when indexing and searching, and not during other steps.
+How do I index my local file system
+) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
+Change this line:
+            -^(file|ftp|mailto|https):
+to this:
+            -^(http|ftp|mailto|https):
+) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
+             # accept anything else +.*
+) By default the [WWW] "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:
+        <property>
+                <name>plugin.includes</name> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
+        </property>