http.agent.name
user
HTTP 'User-Agent' request header.
http.agent.description
MyTest
Further description
http.agent.url
localhost
A URL to advertise in the User-Agent header.
http.agent.email
you@yous
An email address
plugin.folders
/opt/nutchez/nutch/plugins
Directories where nutch plugins are located.
plugin.includes
protocol-http|urlfilter-regex|parse-(text|html|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
Regular expression naming plugin directory names
parse.plugin.file
parse-plugins.xml
The name of the file that defines the associations between
content-types and parsers.
db.max.outlinks.per.page
-1
http.content.limit
-1
indexer.mergeFactor
500
The factor that determines the frequency of Lucene segment
merges. This must not be less than 2, higher values increase indexing
speed but lead to increased RAM usage, and increase the number of
open file handles (which may lead to "Too many open files" errors).
NOTE: the "segments" here have nothing to do with Nutch segments, they
are a low-level data unit used by Lucene.
indexer.minMergeDocs
500
This number determines the minimum number of Lucene
Documents buffered in memory between Lucene segment merges. Larger
values increase indexing speed and increase RAM usage.
db.ignore.external.links
false
If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
file.content.limit
1000000
The length limit for downloaded content, in bytes.
If this value is nonnegative (>=0), content longer than it will be truncated;
otherwise, no truncation at all.