* 此篇原文為 : [http://zolomon.javaeye.com/blog/378871] * 此設定檔的版本為 [http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml Nutch 0.7] * 聲明: * 以翻譯為主(主要是nutch-default.xml), * 外加筆者個人使用nutch的經驗, * 外加官方nutch wiki上的FAQ中http://wiki.apache.org/nutch/FAQ的內容, * 結合過去網友的nutch配置文件講解, nutch-default.xml : {{{ #!xml http.agent.name NutchCVS 我們的 HTTP 'User-Agent' 請求頭. http.robots.agents NutchCVS,Nutch,* 我們要尋找 robots.txt 文件的目標 agent 字符串,可多個, 以逗號分隔, 按優先度降序排列. http.robots.403.allow true 在/robots.txt不存在時,有些服務器返回 HTTP status 403 (Forbidden). 這一般也許意味著我們仍然對該網站進行抓取. 如果此屬性設為false, 我們會認為該網站不允許抓取而不去抓它. http.agent.description Nutch 同樣用在User-Agent頭中. 對bot- 更深入的解釋. 它(這個value中的字符串)將出現在agent.name後的括號中. http.agent.url http://lucene.apache.org/nutch/bot.html 同樣用在User-Agent中. 它(指這個value中的字符串)將出現在agent.name後的字符串中, 只是個用於宣傳等的url地址. http.agent.email nutch-agent@lucene.apache.org 在 HTTP 'From' 請求頭和 User-Agent 頭中, 用於宣傳的電子郵件地址. http.agent.version 0.7.2 在 User-Agent 頭中用於宣傳的版本號. http.timeout 10000 默認網絡超時, 單位毫秒. http.max.delays 3 抓取一個頁面的推遲次數. 每次發現一個host很忙的時候, nutch會推遲fetcher.server.delay這麼久. 在http.max.delays次推遲發生過之後, 這次抓取就會放棄該頁. http.content.limit 65536 下載內容最大限制, 單位bytes. 如果value中的值非零(>=0), 那麼大於這個值的部分將被截斷; 否則不截. http.proxy.host 代理主機名. 如果為空, 則不使用代理. http.proxy.port 代理主機端口. http.verbose false If true, HTTP will log more verbosely. http.redirect.max 3 抓取時候最大redirect數, 如果網頁有超過這個數的redirect, fetcher就會嘗試下一個網頁(放棄這個網頁). file.content.limit 65536 下載內容的長度, 單位是bytes. 如果值不為零, 大於這個值的內容會被截掉; 否則 (零或負數), 不會有內容被截掉. file.content.ignored true 如果為true, 在fetch過程中沒有文件內容會被存儲. 一般情況我們都是希望這樣做的, 因為 file:// 協議的 URL 通常意味著它在本地, 我們可以直接對它執行抓取與建立索引工作. 否則(如果不為真), 文件內容將被存儲. !! NO IMPLEMENTED YET !! (!! 還沒實現 !!) ftp.username anonymous ftp登陸用戶名. ftp.password anonymous@example.com ftp登陸密碼. ftp.content.limit 65536 文件內容長度上限, 單位是bytes. 如果這個值大於零, 大於這個值的內容會被截掉; 否則 (零或負數), 什麼都不會截. 注意: 傳統的 ftp RFCs從未提供部分傳輸 而且, 實際上, 有些ftp服務器無法處理客戶端強行關閉 我們努力嘗試去處理了這種情況, 讓它可以運行流暢. ftp.timeout 60000 默認ftp客戶端socket超時, 單位是毫秒. 也請查閱下邊的ftp.keep.connection屬性. ftp.server.timeout 100000 一個對ftp服務器idle time的估計, 單位是毫秒. 對於多數fgp服務器來講120000毫秒是很典型的. 這個設置最好保守一點. 與ftp.timeout屬性一起, 它們用來決定我們是否需要刪除 (幹掉) 當前 ftp.client 實例並強制重新啟動另一個 ftp.client 實例. 這是需要的,因為一個fetcher線程也許不會在ftp client遠程超時斷開前按時進行下一個request (可能會無所事事). 只有在ftp.keep.connection(參見下邊)是真的時候使用. ftp.keep.connection false 是否保持ftp連接.在同一個主機上一遍又一遍反覆抓取時候很有用. 如果設為真, 它會避開連接, 登陸和目錄列表為子序列url安裝(原文用的setup,此處意思不同於install)解析器. 如果設為真, 那麼, 你必須保證(應該): (1) ftp.timeout必須比ftp.server.timeout小 (2) ftp.timeout必須比(fetcher.threads.fetch * fetcher.server.delay)大 否則在線程日誌中會出現大量"delete client because idled too long"消息. ftp.follow.talk false 是否記錄我們的客戶端與遠程服務器之間的dialogue. 調試(debug)時候很有用. db.default.fetch.interval 30 默認重抓一個網頁的(間隔)天數. db.ignore.internal.links true 如果是真, 在給一個新網頁增加鏈接時, 從同一個主機的鏈接會被忽略. 這是一個非常有效的方法用來限制鏈接數據庫的大小, 只保持質量最高的一個鏈接. db.score.injected 1.0 注入新頁面所需分數injector. db.score.link.external 1.0 添加新頁面時, 來自新主機頁面與原因熱面的分數因子 added due to a link from another host relative to the referencing page's score. db.score.link.internal 1.0 The score factor for pages added due to a link from the same host, relative to the referencing page's score. db.max.outlinks.per.page 100 我們會解析的從一個一頁面出發的外部鏈接的最大數量. db.max.anchor.length 100 鏈接最大長度. db.fetch.retry.max 3 抓取時最大重試次數. fetchlist.score.by.link.count true If true, set page scores on fetchlist entries based on log(number of anchors), instead of using original page scores. This results in prioritization of pages with many incoming links. fetcher.server.delay 5.0 The number of seconds the fetcher will delay between successive requests to the same server. fetcher.threads.fetch 10 同時使用的抓取線程數. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). fetcher.threads.per.host 1 每主機允許的同時抓取最大線程數. fetcher.verbose false 如果為真, fetcher會做更多的log. parser.threads.parse 10 ParseSegment同時應該使用的解析線程數. io.sort.factor 100 The number of streams to merge at once while sorting files. This determines the number of open file handles. io.sort.mb 100 The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. io.file.buffer.size 131072 The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. fs.default.name local The name of the default file system. Either the literal string "local" or a host:port for NDFS. ndfs.name.dir /tmp/nutch/ndfs/name Determines where on the local filesystem the NDFS name node should store the name table. ndfs.data.dir /tmp/nutch/ndfs/data Determines where on the local filesystem an NDFS data node should store its blocks. mapred.job.tracker localhost:8010 The host and port that the MapReduce job tracker runs at. mapred.local.dir /tmp/nutch/mapred/local The local directory where MapReduce stores temprorary files related to tasks and jobs. indexer.score.power 0.5 Determines the power of link analyis scores. Each pages's boost is set to scorescorePower where score is its link analysis score and scorePower is the value of this parameter. This is compiled into indexes, so, when this is changed, pages must be re-indexed for it to take effect. indexer.boost.by.link.count true When true scores for a page are multipled by the log of the number of incoming links to the page. indexer.max.title.length 100 The maximum number of characters of a title that are indexed. indexer.max.tokens 10000 The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. indexer.mergeFactor 50 The factor that determines the frequency of Lucene segment merges. This must not be less than 2, higher values increase indexing speed but lead to increased RAM usage, and increase the number of open file handles (which may lead to "Too many open files" errors). NOTE: the "segments" here have nothing to do with Nutch segments, they are a low-level data unit used by Lucene. indexer.minMergeDocs 50 This number determines the minimum number of Lucene Documents buffered in memory between Lucene segment merges. Larger values increase indexing speed and increase RAM usage. indexer.maxMergeDocs 2147483647 This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values increase indexing speed and reduce the number of Lucene segments, which reduces the number of open file handles; however, this also increases RAM usage during indexing. indexer.termIndexInterval 128 Determines the fraction of terms which Lucene keeps in RAM when searching, to facilitate random-access. Smaller values use more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. analysis.common.terms.file common-terms.utf8 The name of a file containing a list of common terms that should be indexed in n-grams. searcher.dir . Path to root of index directories. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. searcher.filter.cache.size 16 Maximum number of filters to cache. Filters can accelerate certain field-based queries, like language, document format, etc. Each filter requires one bit of RAM per page. So, with a 10 million page index, a cache size of 16 consumes two bytes per page, or 20MB. searcher.filter.cache.threshold 0.05 Filters are cached when their term is matched by more than this fraction of pages. For example, with a threshold of 0.05, and 10 million pages, the term must match more than 1/20, or 50,000 pages. So, if out of 10 million pages, 50% of pages are in English, and 2% are in Finnish, then, with a threshold of 0.05, searches for "lang:en" will use a cached filter, while searches for "lang:fi" will score all 20,000 finnish documents. searcher.hostgrouping.rawhits.factor 2.0 A factor that is used to determine the number of raw hits initially fetched, before host grouping is done. searcher.summary.context 5 The number of context terms to display preceding and following matching terms in a hit summary. searcher.summary.length 20 The total number of terms to display in a hit summary. urlnormalizer.class org.apache.nutch.net.BasicUrlNormalizer Name of the class used to normalize URLs. urlnormalizer.regex.file regex-normalize.xml Name of the config file used by the RegexUrlNormalizer class. mime.types.file mime-types.xml Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information mime.type.magic true Defines if the mime content type detector uses magic resolution. ipc.client.timeout 10000 Defines the timeout for IPC calls in milliseconds. plugin.folders plugins Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath. plugin.includes nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url) Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. plugin.excludes Regular expression naming plugin directory names to exclude. parser.character.encoding.default windows-1252 The character encoding to fall back to when no other information is available parser.html.impl neko HTML Parser implementation. Currently the following keywords are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup. urlfilter.regex.file regex-urlfilter.txt Name of file on CLASSPATH containing regular expressions used by urlfilter-regex (RegexURLFilter) plugin. urlfilter.prefix.file prefix-urlfilter.txt Name of file on CLASSPATH containing url prefixes used by urlfilter-prefix (PrefixURLFilter) plugin. urlfilter.order The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. extension.clustering.hits-to-cluster 100 Number of snippets retrieved for the clustering extension if clustering extension is available and user requested results to be clustered. extension.clustering.extension-name Use the specified online clustering extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file. extension.ontology.extension-name Use the specified online ontology extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file. extension.ontology.urls Urls of owl files, separated by spaces, such as http://www.example.com/ontology/time.owl http://www.example.com/ontology/space.owl http://www.example.com/ontology/wine.owl Or file:/ontology/time.owl file:/ontology/space.owl file:/ontology/wine.owl You have to make sure each url is valid. By default, there is no owl file, so query refinement based on ontology is silently ignored. query.url.boost 4.0 Used as a boost for url field in Lucene query. query.anchor.boost 2.0 Used as a boost for anchor field in Lucene query. query.title.boost 1.5 Used as a boost for title field in Lucene query. query.host.boost 2.0 Used as a boost for host field in Lucene query. query.phrase.boost 1.0 Used as a boost for phrase in Lucene query. Multiplied by boost for field phrase is matched in. lang.ngram.min.length 1 The minimum size of ngrams to uses to identify language (must be between 1 and lang.ngram.max.length). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. lang.ngram.max.length 4 The maximum size of ngrams to uses to identify language (must be between lang.ngram.min.length and 4). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. lang.analyze.max.length 2048 The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analysis, but the slowest it is. }}}