file.content.limit 65536 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. file.content.ignored true If true, no file content will be saved during fetch. And it is probably what we want to set most of time, since file:// URLs are meant to be local and we can always use them directly at parsing and indexing stages. Otherwise file contents will be saved. !! NO IMPLEMENTED YET !! http.agent.name HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. http.robots.agents * The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* http.robots.403.allow true Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden. http.agent.description Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. http.agent.url A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. Custom dictates that this should be a URL of a page explaining the purpose and behavior of this crawler. http.agent.email An email address to advertise in the HTTP 'From' request header and User-Agent header. A good practice is to mangle this address (e.g. 'info at example dot com') to avoid spamming. http.agent.version Nutch-1.0 A version string to advertise in the User-Agent header. http.agent.host Name or IP address of the host on which the Nutch crawler would be running. Currently this is used by 'protocol-httpclient' plugin. http.timeout 10000 The default network timeout, in milliseconds. http.max.delays 100 The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy, it will wait fetcher.server.delay. After http.max.delays attepts, it will give up on the page for now. http.content.limit 65536 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. http.proxy.host The proxy hostname. If empty, no proxy is used. http.proxy.port The proxy port. http.proxy.username Username for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. NOTE: For NTLM authentication, do not prefix the username with the domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. http.proxy.password Password for proxy. This will be used by 'protocol-httpclient', if the proxy server requests basic, digest and/or NTLM authentication. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. http.proxy.realm Authentication realm for proxy. Do not define a value if realm is not required or authentication should take place for any realm. NTLM does not use the notion of realms. Specify the domain name of NTLM authentication as the value for this property. To use this, 'protocol-httpclient' must be present in the value of 'plugin.includes' property. http.auth.file httpclient-auth.xml Authentication configuration file for 'protocol-httpclient' plugin. http.verbose false If true, HTTP will log more verbosely. http.redirect.max 0 The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. http.useHttp11 false NOTE: at the moment this works only for protocol-httpclient. If true, use HTTP 1.1, if false use HTTP 1.0 . ftp.username anonymous ftp login username. ftp.password anonymous@example.com ftp login password. ftp.content.limit 65536 The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. ftp.timeout 60000 Default timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below. ftp.server.timeout 100000 An estimation of ftp server idle time, in millisec. Typically it is 120000 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). ftp.keep.connection false Whether to keep ftp connection. Useful if crawling same host again and again. When set to true, it avoids connection, login and dir list parser setup for subsequent urls. If it is set to true, however, you must make sure (roughly): (1) ftp.timeout is less than ftp.server.timeout (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) Otherwise there will be too many "delete client because idled too long" messages in thread logs. ftp.follow.talk false Whether to log dialogue between our client and remote server. Useful for debugging. db.default.fetch.interval 30 (DEPRECATED) The default number of days between re-fetches of a page. db.fetch.interval.default 2592000 The default number of seconds between re-fetches of a page (30 days). db.fetch.interval.max 7776000 The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. db.fetch.schedule.class org.apache.nutch.crawl.DefaultFetchSchedule The implementation of fetch schedule. DefaultFetchSchedule simply adds the original fetchInterval to the last fetch time, regardless of page changes. db.fetch.schedule.adaptive.inc_rate 0.4 If a page is unmodified, its fetchInterval will be increased by this rate. This value should not exceed 0.5, otherwise the algorithm becomes unstable. db.fetch.schedule.adaptive.dec_rate 0.2 If a page is modified, its fetchInterval will be decreased by this rate. This value should not exceed 0.5, otherwise the algorithm becomes unstable. db.fetch.schedule.adaptive.min_interval 60.0 Minimum fetchInterval, in seconds. db.fetch.schedule.adaptive.max_interval 31536000.0 Maximum fetchInterval, in seconds (365 days). NOTE: this is limited by db.fetch.interval.max. Pages with fetchInterval larger than db.fetch.interval.max will be fetched anyway. db.fetch.schedule.adaptive.sync_delta true If true, try to synchronize with the time of page change. by shifting the next fetchTime by a fraction (sync_rate) of the difference between the last modification time, and the last fetch time. db.fetch.schedule.adaptive.sync_delta_rate 0.3 See sync_delta for description. This value should not exceed 0.5, otherwise the algorithm becomes unstable. db.update.additions.allowed true If true, updatedb will add newly discovered URLs, if false only already existing URLs in the CrawlDb will be updated and no new URLs will be added. db.ignore.internal.links true If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. db.ignore.external.links false If true, outlinks leading from a page to external hosts will be ignored. This is an effective way to limit the crawl to include only initially injected hosts, without creating complex URLFilters. db.score.injected 1.0 The score of new pages added by the injector. db.score.link.external 1.0 The score factor for new pages added due to a link from another host relative to the referencing page's score. Scoring plugins may use this value to affect initial scores of external links. db.score.link.internal 1.0 The score factor for pages added due to a link from the same host, relative to the referencing page's score. Scoring plugins may use this value to affect initial scores of internal links. db.score.count.filtered false The score value passed to newly discovered pages is calculated as a fraction of the original page score divided by the number of outlinks. If this option is false, only the outlinks that passed URLFilters will count, if it's true then all outlinks will count. db.max.inlinks 10000 Maximum number of Inlinks per URL to be kept in LinkDb. If "invertlinks" finds more inlinks than this number, only the first N inlinks will be stored, and the rest will be discarded. db.max.outlinks.per.page 100 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. db.max.anchor.length 100 The maximum number of characters permitted in an anchor. db.fetch.retry.max 3 The maximum number of times a url that has encountered recoverable errors is generated for fetch. db.signature.class org.apache.nutch.crawl.MD5Signature The default implementation of a page signature. Signatures created with this implementation will be used for duplicate detection and removal. db.signature.text_profile.min_token_len 2 Minimum token length to be included in the signature. db.signature.text_profile.quant_rate 0.01 Profile frequencies will be rounded down to a multiple of QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token frequency. If maxFreq > 1 then QUANT will be at least 2, which means that for longer texts tokens with frequency 1 will always be discarded. generate.max.per.host -1 The maximum number of urls per host in a single fetchlist. -1 if unlimited. generate.max.per.host.by.ip false If false, same host names are counted. If true, hosts' IP addresses are resolved and the same IP-s are counted. -+-+-+- WARNING !!! -+-+-+- When set to true, Generator will create a lot of DNS lookup requests, rapidly. This may cause a DOS attack on remote DNS servers, not to mention increased external traffic and latency. For these reasons when using this option it is required that a local caching DNS be used. generate.update.crawldb false For highly-concurrent environments, where several generate/fetch/update cycles may overlap, setting this to true ensures that generate will create different fetchlists even without intervening updatedb-s, at the cost of running an additional job to update CrawlDB. If false, running generate twice without intervening updatedb will generate identical fetchlists. fetcher.server.delay 5.0 The number of seconds the fetcher will delay between successive requests to the same server. fetcher.server.min.delay 0.0 The minimum number of seconds the fetcher will delay between successive requests to the same server. This value is applicable ONLY if fetcher.threads.per.host is greater than 1 (i.e. the host blocking is turned off). fetcher.max.crawl.delay 30 If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. fetcher.threads.fetch 10 The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). fetcher.threads.per.host 1 This number is the maximum number of threads that should be allowed to access a host at one time. fetcher.threads.per.host.by.ip true If true, then fetcher will count threads by IP address, to which the URL's host name resolves. If false, only host name will be used. NOTE: this should be set to the same value as "generate.max.per.host.by.ip" - default settings are different only for reasons of backward-compatibility. fetcher.verbose false If true, fetcher will log more verbosely. fetcher.parse true If true, fetcher will parse content. fetcher.store.content true If true, fetcher will store content. indexer.score.power 0.5 Determines the power of link analyis scores. Each pages's boost is set to score^scorePower where score is its link analysis score and scorePower is the value of this parameter. This is compiled into indexes, so, when this is changed, pages must be re-indexed for it to take effect. indexer.max.title.length 100 The maximum number of characters of a title that are indexed. indexer.max.tokens 10000 The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to -1, then the only limit is your memory, but you should anticipate an OutOfMemoryError. indexer.mergeFactor 50 The factor that determines the frequency of Lucene segment merges. This must not be less than 2, higher values increase indexing speed but lead to increased RAM usage, and increase the number of open file handles (which may lead to "Too many open files" errors). NOTE: the "segments" here have nothing to do with Nutch segments, they are a low-level data unit used by Lucene. indexer.minMergeDocs 50 This number determines the minimum number of Lucene Documents buffered in memory between Lucene segment merges. Larger values increase indexing speed and increase RAM usage. indexer.maxMergeDocs 2147483647 This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values increase batch indexing speed and reduce the number of Lucene segments, which reduces the number of open file handles; however, this also decreases incremental indexing performance. indexer.termIndexInterval 128 Determines the fraction of terms which Lucene keeps in RAM when searching, to facilitate random-access. Smaller values use more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. indexingfilter.order The order by which index filters are applied. If empty, all available index filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter then BasicIndexingFilter is applied first, and MoreIndexingFilter second. Filter ordering might have impact on result if one filter depends on output of another filter. analysis.common.terms.file common-terms.utf8 The name of a file containing a list of common terms that should be indexed in n-grams. searcher.dir crawl Path to root of crawl. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. searcher.filter.cache.size 16 Maximum number of filters to cache. Filters can accelerate certain field-based queries, like language, document format, etc. Each filter requires one bit of RAM per page. So, with a 10 million page index, a cache size of 16 consumes two bytes per page, or 20MB. searcher.filter.cache.threshold 0.05 Filters are cached when their term is matched by more than this fraction of pages. For example, with a threshold of 0.05, and 10 million pages, the term must match more than 1/20, or 50,000 pages. So, if out of 10 million pages, 50% of pages are in English, and 2% are in Finnish, then, with a threshold of 0.05, searches for "lang:en" will use a cached filter, while searches for "lang:fi" will score all 20,000 finnish documents. searcher.hostgrouping.rawhits.factor 2.0 A factor that is used to determine the number of raw hits initially fetched, before host grouping is done. searcher.summary.context 5 The number of context terms to display preceding and following matching terms in a hit summary. searcher.summary.length 20 The total number of terms to display in a hit summary. searcher.max.hits -1 If positive, search stops after this many hits are found. Setting this to small, positive values (e.g., 1000) can make searches much faster. With a sorted index, the quality of the hits suffers little. searcher.max.time.tick_count -1 If positive value is defined here, limit search time for every request to this number of elapsed ticks (see the tick_length property below). The total maximum time for any search request will be then limited to tick_count * tick_length milliseconds. When search time is exceeded, partial results will be returned, and the total number of hits will be estimated. searcher.max.time.tick_length 200 The number of milliseconds between ticks. Larger values reduce the timer granularity (precision). Smaller values bring more overhead. searcher.num.handlers 10 The number of handlers for the distributed search server. searcher.max.hits.per.page 1000 The maximum number of hits to show per page. -1 if unlimited. If the number of hits requested by the user (via hitsPerPage parameter in the query string) is more than the value specified in this property, then this value is assumed as the number of hits per page. urlnormalizer.order org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer Order in which normalizers will run. If any of these isn't activated it will be silently skipped. If other normalizers not on the list are activated, they will run in random order after the ones specified here are run. urlnormalizer.regex.file regex-normalize.xml Name of the config file used by the RegexUrlNormalizer class. urlnormalizer.loop.count 1 Optionally loop through normalizers several times, to make sure that all transformations have been performed. mime.types.file tika-mimetypes.xml Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information mime.type.magic true Defines if the mime content type detector uses magic resolution. plugin.folders plugins Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath. plugin.auto-activation true Defines if some plugins that are not activated regarding the plugin.includes and plugin.excludes properties must be automaticaly activated if they are needed by some actived plugins. plugin.includes protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. plugin.excludes Regular expression naming plugin directory names to exclude. parse.plugin.file parse-plugins.xml The name of the file that defines the associations between content-types and parsers. parser.character.encoding.default windows-1252 The character encoding to fall back to when no other information is available encodingdetector.charset.min.confidence -1 A integer between 0-100 indicating minimum confidence value for charset auto-detection. Any negative value disables auto-detection. parser.caching.forbidden.policy content If a site (or a page) requests through its robot metatags that it should not be shown as cached content, apply this policy. Currently three keywords are recognized: "none" ignores any "noarchive" directives. "content" doesn't show the content, but shows summaries (snippets). "all" doesn't show either content or summaries. parser.html.impl neko HTML Parser implementation. Currently the following keywords are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup. parser.html.form.use_action false If true, HTML parser will collect URLs from form action attributes. This may lead to undesirable behavior (submitting empty forms during next fetch cycle). If false, form action attribute will be ignored. parser.html.outlinks.ignore_tags Comma separated list of HTML tags, from which outlinks shouldn't be extracted. Nutch takes links from: a, area, form, frame, iframe, script, link, img. If you add any of those tags here, it won't be taken. Default is empty list. Probably reasonable value for most people would be "img,script,link". urlfilter.domain.file domain-urlfilter.txt Name of file on CLASSPATH containing either top level domains or hostnames used by urlfilter-domain (DomainURLFilter) plugin. urlfilter.regex.file regex-urlfilter.txt Name of file on CLASSPATH containing regular expressions used by urlfilter-regex (RegexURLFilter) plugin. urlfilter.automaton.file automaton-urlfilter.txt Name of file on CLASSPATH containing regular expressions used by urlfilter-automaton (AutomatonURLFilter) plugin. urlfilter.prefix.file prefix-urlfilter.txt Name of file on CLASSPATH containing url prefixes used by urlfilter-prefix (PrefixURLFilter) plugin. urlfilter.suffix.file suffix-urlfilter.txt Name of file on CLASSPATH containing url suffixes used by urlfilter-suffix (SuffixURLFilter) plugin. urlfilter.order The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. scoring.filter.order The order in which scoring filters are applied. This may be left empty (in which case all available scoring filters will be applied in the order defined in plugin-includes and plugin-excludes), or a space separated list of implementation classes. extension.clustering.hits-to-cluster 100 Number of snippets retrieved for the clustering extension if clustering extension is available and user requested results to be clustered. extension.clustering.extension-name Use the specified online clustering extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file. extension.ontology.extension-name Use the specified online ontology extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file. extension.ontology.urls Urls of owl files, separated by spaces, such as http://www.example.com/ontology/time.owl http://www.example.com/ontology/space.owl http://www.example.com/ontology/wine.owl Or file:/ontology/time.owl file:/ontology/space.owl file:/ontology/wine.owl You have to make sure each url is valid. By default, there is no owl file, so query refinement based on ontology is silently ignored. query.url.boost 4.0 Used as a boost for url field in Lucene query. query.anchor.boost 2.0 Used as a boost for anchor field in Lucene query. query.title.boost 1.5 Used as a boost for title field in Lucene query. query.host.boost 2.0 Used as a boost for host field in Lucene query. query.phrase.boost 1.0 Used as a boost for phrase in Lucene query. Multiplied by boost for field phrase is matched in. query.cc.boost 0.0 Used as a boost for cc field in Lucene query. query.type.boost 0.0 Used as a boost for type field in Lucene query. query.site.boost 0.0 Used as a boost for site field in Lucene query. query.tag.boost 1.0 Used as a boost for tag field in Lucene query. lang.ngram.min.length 1 The minimum size of ngrams to uses to identify language (must be between 1 and lang.ngram.max.length). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. lang.ngram.max.length 4 The maximum size of ngrams to uses to identify language (must be between lang.ngram.min.length and 4). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. lang.analyze.max.length 2048 The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analysis, but the slowest it is. query.lang.boost 0.0 Used as a boost for lang field in Lucene query. hadoop.job.history.user.location ${hadoop.log.dir}/history/user Hadoop 0.17.x comes with a default setting to create user logs inside the output path of the job. This breaks some Hadoop classes, which expect the output to contain only part-XXXXX files. This setting changes the output to a subdirectory of the regular log directory. search.response.default.type xml The default response type returned if none is specified. search.response.default.lang en The default response language if none is specified. search.response.default.numrows 10 The default number of rows to return if none is specified. search.response.default.dedupfield site The default dedup field if none is specified. search.response.default.numdupes 1 The default number of duplicates returned if none is specified. searcher.response.maxage 86400 The maxage of a response in seconds. Used in caching headers. searcher.response.prettyprint true Should the response output be pretty printed. Setting to true enables better debugging, false removes unneeded spaces and gives better throughput.