http.agent.name user HTTP 'User-Agent' request header. http.agent.description MyTest Further description http.agent.url localhost A URL to advertise in the User-Agent header. http.agent.email you@yous An email address plugin.folders /opt/nutch/plugins Directories where nutch plugins are located. plugin.includes protocol-http|urlfilter-regex|parse-(text|html|ext|msexcel|mspowerpoint|msword|oo|pdf|rss|zip)|index-(more|basic|anchor)|query-(more|basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names parse.plugin.file parse-plugins.xml The name of the file that defines the associations between content-types and parsers. db.max.outlinks.per.page -1 http.content.limit -1 indexer.mergeFactor 500 The factor that determines the frequency of Lucene segment merges. This must not be less than 2, higher values increase indexing speed but lead to increased RAM usage, and increase the number of open file handles (which may lead to "Too many open files" errors). NOTE: the "segments" here have nothing to do with Nutch segments, they are a low-level data unit used by Lucene. indexer.minMergeDocs 500 This number determines the minimum number of Lucene Documents buffered in memory between Lucene segment merges. Larger values increase indexing speed and increase RAM usage. db.ignore.external.links false If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. file.content.limit 1000000 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all.