| 12 | How can I fetch only some sites at a time? |
| 13 | # Use -topN to limit the amount of pages all together. |
| 14 | # Use -numFetchers to generate multiple small segments. |
| 15 | # Now you could either generate new segments. Maybe you whould use -adddays to allow bin/nutch generate to put all the urls in the new fetchlist again. Add more then 7 days if you did not make a updatedb. |
| 16 | # Or send the process a unix STOP signal. You should be able to index the part of the segment for crawling which is allready fetched. Then later send a CONT signal to the process. Do not turn off your computer between! :) |
| 17 | |
| 18 | How many concurrent threads should I use? |
| 19 | ulimit to 65535 (ulimit -n 65535), |
| 20 | |
| 21 | How can I force fetcher to use custom nutch-config? |
| 22 | #Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig |
| 23 | #Copy these files from $NUTCH_HOME/conf to the new directory: common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, regex-normalize.xml, regex-urlfilter.txt |
| 24 | #Modify the nutch-default.xml to suite your needs |
| 25 | #Set NUTCH_CONF_DIR environment variable to point into the directory you created |
| 26 | #run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment variable. You should check the command outputs for lines where the configs are loaded, that they are really loaded from your custom dir. |
| 27 | |
| 28 | bin/nutch generate generates empty fetchlist, what can I do? |
| 29 | Call bin/nutch generate with the -adddays 30 (if you haven't changed the default settings) to make generate think the time has come... |
| 30 | After generate you can call bin/nutch fetch. |
| 31 | |
| 32 | How can I fetch pages that require Authentication? |
| 33 | See HttpAuthenticationSchemes. |
| 34 | |
| 35 | Is it possible to change the list of common words without crawling everything again? |
| 36 | Yes. The list of common words is used only when indexing and searching, and not during other steps. |
| 37 | |
| 38 | How do I index my local file system |
| 39 | 1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. |
| 40 | Change this line: |
| 41 | -^(file|ftp|mailto|https): |
| 42 | to this: |
| 43 | -^(http|ftp|mailto|https): |
| 44 | 2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: |
| 45 | # accept anything else +.* |
| 46 | 3) By default the [WWW] "file plugin" is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this: |
| 47 | <property> |
| 48 | <name>plugin.includes</name> <value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value> |
| 49 | </property> |