Version 1 (modified by waue, 15 years ago) (diff) |
---|
- 此設定檔的版本為 Nutch 0.7
- 聲明:
- 以翻譯為主(主要是nutch-default.xml),
- 外加筆者個人使用nutch的經驗,
- 外加官方nutch wiki上的FAQ中http://wiki.apache.org/nutch/FAQ的內容,
- 結合過去網友的nutch配置文件講解,
nutch-default.xml :
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?> <!--begin 首先是一些說明,描述了這篇文檔該怎樣使用之類的 begin--> <!-- 不可以直接修改此文檔. 但是可以複製所需的屬性(這裡把entry翻譯成了屬性,原文中的entry就是指<property></property> 之間的內容,或者更準確的說是不包括<value></value>等可變內容的)到nutch-site.xml並修改其值來使用.如果nutch-site.xml不存在的話請自己創建它. --> <!--end 創建nutch-site.xml的樣式可以有幾種,指定不同的xsl即可使用不同的樣式,如果網上出現了不同樣是的nutch配置文件請讀者朋友不要見怪.關於每個xsl所指定的樣式到底是什麼,這裡不對其進行描述,請讀者自己查閱nutch的壓縮包裡提供的xsl文件 end--> <!--begin nutch配置文件根元素 begin--> <nutch-conf> <!--begin nutch配置文件中的屬性配置是分塊的,每一塊配置一部分屬性,結構可以清晰的區分出來,如果想修改什麼內容直接到那一塊地方去找相關屬性即可.比如下面這個HTTP properties就是http相關設置的屬性,後面還有ftp相關設置,searcher相關設置等等 begin--> <!-- HTTP properties --> <property> <name>http.agent.name</name> <value>NutchCVS</value> <description>我們的 HTTP 'User-Agent' 請求頭.</description> </property> <!--end 筆者也不是很明確這個屬性到底是做什麼用的,但是它是nutch 1.0配置文件中3個必須屬性中的一個.有可能是apache蒐集nutch用戶信息所用. end--> <property> <name>http.robots.agents</name> <value>NutchCVS,Nutch,*</value> <description>我們要尋找 robots.txt 文件的目標 agent 字符串,可多個, 以逗號分隔, 按優先度降序排列.</description> </property> <!--end 要去讀取robots.txt文件是搜索引擎的協議規則, 我們的搜索引擎協定會去遵守robots.txt中所做的要求.關於robots.txt,可以參見http://www.robotstxt.org/ end--> <property> <name>http.robots.403.allow</name> <value>true</value> <description>在/robots.txt不存在時,有些服務器返回 HTTP status 403 (Forbidden). 這一般也許意味著我們仍然對該網站進行抓取. 如果此屬性設為false, 我們會認為該網站不允許抓取而不去抓它.</description> </property> <property> <name>http.agent.description</name> <value>Nutch</value> <description>同樣用在User-Agent頭中. 對bot- 更深入的解釋. 它(這個value中的字符串)將出現在agent.name後的括號中. </description> </property> <property> <name>http.agent.url</name> <value>http://lucene.apache.org/nutch/bot.html</value> <description>同樣用在User-Agent中. 它(指這個value中的字符串)將出現在agent.name後的字符串中, 只是個用於宣傳等的url地址. </description> </property> <property> <name>http.agent.email</name> <value>nutch-agent@lucene.apache.org</value> <description>在 HTTP 'From' 請求頭和 User-Agent 頭中, 用於宣傳的電子郵件地址.</description> </property> <property> <name>http.agent.version</name> <value>0.7.2</value> <description>在 User-Agent 頭中用於宣傳的版本號.</description> </property> <property> <name>http.timeout</name> <value>10000</value> <description>默認網絡超時, 單位毫秒.</description> </property> <property> <name>http.max.delays</name> <value>3</value> <description>抓取一個頁面的推遲次數. 每次發現一個host很忙的時候, nutch會推遲fetcher.server.delay這麼久. 在http.max.delays次推遲發生過之後, 這次抓取就會放棄該頁.</description> </property> <property> <name>http.content.limit</name> <value>65536</value> <description>下載內容最大限制, 單位bytes. 如果value中的值非零(>=0), 那麼大於這個值的部分將被截斷; 否則不截. </description> </property> <!--end 這裡的下載不是指我們手工去點下載一個軟件.有些入門級讀者會誤把這個"下載"當做網頁上存在下載項(比如一個附件)的情況.我們所說的下載,是指只要我們在訪問一個網頁的時候,都會從網絡上把這個網頁下載下來,才能在自己的瀏覽器裡查看,打開一個網頁,或者訪問一個網頁的情況,就存在一次對這個網頁的下載過程 end--> <property> <name>http.proxy.host</name> <value></value> <description>代理主機名. 如果為空, 則不使用代理.</description> </property> <property> <name>http.proxy.port</name> <value></value> <description>代理主機端口.</description> </property> <property> <name>http.verbose</name> <value>false</value> <description>If true, HTTP will log more verbosely.</description> </property> <!--end 具體效果不明, 有待進一步嘗試. 翻譯的結果大概是, 如果這個值為真, 那麼會對HTTP活動進行非常冗長的log. end--> <property> <name>http.redirect.max</name> <value>3</value> <description>抓取時候最大redirect數, 如果網頁有超過這個數的redirect, fetcher就會嘗試下一個網頁(放棄這個網頁).</description> </property> <!-- FILE properties --> <property> <name>file.content.limit</name> <value>65536</value> <description>下載內容的長度, 單位是bytes. 如果值不為零, 大於這個值的內容會被截掉; 否則 (零或負數), 不會有內容被截掉. </description> </property> <property> <name>file.content.ignored</name> <value>true</value> <description>如果為true, 在fetch過程中沒有文件內容會被存儲. 一般情況我們都是希望這樣做的, 因為 file:// 協議的 URL 通常意味著它在本地, 我們可以直接對它執行抓取與建立索引工作. 否則(如果不為真), 文件內容將被存儲. !! NO IMPLEMENTED YET !! (!! 還沒實現 !!) </description> </property> <!-- FTP properties --> <property> <name>ftp.username</name> <value>anonymous</value> <description>ftp登陸用戶名.</description> </property> <property> <name>ftp.password</name> <value>anonymous@example.com</value> <description>ftp登陸密碼.</description> </property> <property> <name>ftp.content.limit</name> <value>65536</value> <description>文件內容長度上限, 單位是bytes. 如果這個值大於零, 大於這個值的內容會被截掉; 否則 (零或負數), 什麼都不會截. 注意: 傳統的 ftp RFCs從未提供部分傳輸 而且, 實際上, 有些ftp服務器無法處理客戶端強行關閉 我們努力嘗試去處理了這種情況, 讓它可以運行流暢. </description> </property> <property> <name>ftp.timeout</name> <value>60000</value> <description>默認ftp客戶端socket超時, 單位是毫秒. 也請查閱下邊的ftp.keep.connection屬性.</description> </property> <property> <name>ftp.server.timeout</name> <value>100000</value> <description>一個對ftp服務器idle time的估計, 單位是毫秒. 對於多數fgp服務器來講120000毫秒是很典型的. 這個設置最好保守一點. 與ftp.timeout屬性一起, 它們用來決定我們是否需要刪除 (幹掉) 當前 ftp.client 實例並強制重新啟動另一個 ftp.client 實例. 這是需要的,因為一個fetcher線程也許不會在ftp client遠程超時斷開前按時進行下一個request (可能會無所事事). 只有在ftp.keep.connection(參見下邊)是真的時候使用. </description> </property> <property> <name>ftp.keep.connection</name> <value>false</value> <description>是否保持ftp連接.在同一個主機上一遍又一遍反覆抓取時候很有用. 如果設為真, 它會避開連接, 登陸和目錄列表為子序列url安裝(原文用的setup,此處意思不同於install)解析器. 如果設為真, 那麼, 你必須保證(應該): (1) ftp.timeout必須比ftp.server.timeout小 (2) ftp.timeout必須比(fetcher.threads.fetch * fetcher.server.delay)大 否則在線程日誌中會出現大量"delete client because idled too long"消息.</description> </property> <property> <name>ftp.follow.talk</name> <value>false</value> <description>是否記錄我們的客戶端與遠程服務器之間的dialogue. 調試(debug)時候很有用.</description> </property> <!-- web db properties --> <property> <name>db.default.fetch.interval</name> <value>30</value> <description>默認重抓一個網頁的(間隔)天數. </description> </property> <property> <name>db.ignore.internal.links</name> <value>true</value> <description>如果是真, 在給一個新網頁增加鏈接時, 從同一個主機的鏈接會被忽略. 這是一個非常有效的方法用來限制鏈接數據庫的大小, 只保持質量最高的一個鏈接. </description> </property> <!--end 這個屬性對影響搜索引擎展示頁面的效果非常有用 end--> <property> <name>db.score.injected</name> <value>1.0</value> <description>注入新頁面所需分數injector. </description> </property> <!--end end--> <property> <name>db.score.link.external</name> <value>1.0</value> <description>添加新頁面時, 來自新主機頁面與原因熱面的分數因子 added due to a link from another host relative to the referencing page's score. </description> </property> <property> <name>db.score.link.internal</name> <value>1.0</value> <description>The score factor for pages added due to a link from the same host, relative to the referencing page's score. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>100</value> <description>我們會解析的從一個一頁面出發的外部鏈接的最大數量.</description> </property> <property> <name>db.max.anchor.length</name> <value>100</value> <description>鏈接最大長度.</description> </property> <property> <name>db.fetch.retry.max</name> <value>3</value> <description>抓取時最大重試次數.</description> </property> <!-- fetchlist tool properties --> <property> <name>fetchlist.score.by.link.count</name> <value>true</value> <description>If true, set page scores on fetchlist entries based on log(number of anchors), instead of using original page scores. This results in prioritization of pages with many incoming links. </description> </property> <!-- fetcher properties --> <property> <name>fetcher.server.delay</name> <value>5.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server.</description> </property> <property> <name>fetcher.threads.fetch</name> <value>10</value> <description>同時使用的抓取線程數. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).</description> </property> <property> <name>fetcher.threads.per.host</name> <value>1</value> <description>每主機允許的同時抓取最大線程數.</description> </property> <property> <name>fetcher.verbose</name> <value>false</value> <description>如果為真, fetcher會做更多的log.</description> </property> <!-- parser properties --> <property> <name>parser.threads.parse</name> <value>10</value> <description>ParseSegment同時應該使用的解析線程數.</description> </property> <!-- i/o properties --> <property> <name>io.sort.factor</name> <value>100</value> <description>The number of streams to merge at once while sorting files. This determines the number of open file handles.</description> </property> <property> <name>io.sort.mb</name> <value>100</value> <description>The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.</description> </property> <property> <name>io.file.buffer.size</name> <value>131072</value> <description>The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations.</description> </property> <!-- file system properties --> <property> <name>fs.default.name</name> <value>local</value> <description>The name of the default file system. Either the literal string "local" or a host:port for NDFS.</description> </property> <property> <name>ndfs.name.dir</name> <value>/tmp/nutch/ndfs/name</value> <description>Determines where on the local filesystem the NDFS name node should store the name table.</description> </property> <property> <name>ndfs.data.dir</name> <value>/tmp/nutch/ndfs/data</value> <description>Determines where on the local filesystem an NDFS data node should store its blocks.</description> </property> <!-- map/reduce properties --> <property> <name>mapred.job.tracker</name> <value>localhost:8010</value> <description>The host and port that the MapReduce job tracker runs at. </description> </property> <property> <name>mapred.local.dir</name> <value>/tmp/nutch/mapred/local</value> <description>The local directory where MapReduce stores temprorary files related to tasks and jobs. </description> </property> <!-- indexer properties --> <property> <name>indexer.score.power</name> <value>0.5</value> <description>Determines the power of link analyis scores. Each pages's boost is set to <i>score<sup>scorePower</sup></i> where <i>score</i> is its link analysis score and <i>scorePower</i> is the value of this parameter. This is compiled into indexes, so, when this is changed, pages must be re-indexed for it to take effect.</description> </property> <property> <name>indexer.boost.by.link.count</name> <value>true</value> <description>When true scores for a page are multipled by the log of the number of incoming links to the page.</description> </property> <property> <name>indexer.max.title.length</name> <value>100</value> <description>The maximum number of characters of a title that are indexed. </description> </property> <property> <name>indexer.max.tokens</name> <value>10000</value> <description> The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError. </description> </property> <property> <name>indexer.mergeFactor</name> <value>50</value> <description>The factor that determines the frequency of Lucene segment merges. This must not be less than 2, higher values increase indexing speed but lead to increased RAM usage, and increase the number of open file handles (which may lead to "Too many open files" errors). NOTE: the "segments" here have nothing to do with Nutch segments, they are a low-level data unit used by Lucene. </description> </property> <property> <name>indexer.minMergeDocs</name> <value>50</value> <description>This number determines the minimum number of Lucene Documents buffered in memory between Lucene segment merges. Larger values increase indexing speed and increase RAM usage. </description> </property> <property> <name>indexer.maxMergeDocs</name> <value>2147483647</value> <description>This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values increase indexing speed and reduce the number of Lucene segments, which reduces the number of open file handles; however, this also increases RAM usage during indexing. </description> </property> <property> <name>indexer.termIndexInterval</name> <value>128</value> <description>Determines the fraction of terms which Lucene keeps in RAM when searching, to facilitate random-access. Smaller values use more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. </description> </property> <!-- analysis properties --> <property> <name>analysis.common.terms.file</name> <value>common-terms.utf8</value> <description>The name of a file containing a list of common terms that should be indexed in n-grams.</description> </property> <!-- searcher properties --> <property> <name>searcher.dir</name> <value>.</value> <description> Path to root of index directories. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. </description> </property> <property> <name>searcher.filter.cache.size</name> <value>16</value> <description> Maximum number of filters to cache. Filters can accelerate certain field-based queries, like language, document format, etc. Each filter requires one bit of RAM per page. So, with a 10 million page index, a cache size of 16 consumes two bytes per page, or 20MB. </description> </property> <property> <name>searcher.filter.cache.threshold</name> <value>0.05</value> <description> Filters are cached when their term is matched by more than this fraction of pages. For example, with a threshold of 0.05, and 10 million pages, the term must match more than 1/20, or 50,000 pages. So, if out of 10 million pages, 50% of pages are in English, and 2% are in Finnish, then, with a threshold of 0.05, searches for "lang:en" will use a cached filter, while searches for "lang:fi" will score all 20,000 finnish documents. </description> </property> <property> <name>searcher.hostgrouping.rawhits.factor</name> <value>2.0</value> <description> A factor that is used to determine the number of raw hits initially fetched, before host grouping is done. </description> </property> <property> <name>searcher.summary.context</name> <value>5</value> <description> The number of context terms to display preceding and following matching terms in a hit summary. </description> </property> <property> <name>searcher.summary.length</name> <value>20</value> <description> The total number of terms to display in a hit summary. </description> </property> <!-- URL normalizer properties --> <property> <name>urlnormalizer.class</name> <value>org.apache.nutch.net.BasicUrlNormalizer</value> <description>Name of the class used to normalize URLs.</description> </property> <property> <name>urlnormalizer.regex.file</name> <value>regex-normalize.xml</value> <description>Name of the config file used by the RegexUrlNormalizer class.</description></property> <!-- mime properties --> <property> <name>mime.types.file</name> <value>mime-types.xml</value> <description>Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information</description> </property> <property> <name>mime.type.magic</name> <value>true</value> <description>Defines if the mime content type detector uses magic resolution. </description> </property> <!-- ipc properties --> <property> <name>ipc.client.timeout</name> <value>10000</value> <description>Defines the timeout for IPC calls in milliseconds. </description> </property> <!-- plugin properties --> <property> <name>plugin.folders</name> <value>plugins</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property> <property> <name>plugin.includes</name> <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. </description> </property> <property> <name>plugin.excludes</name> <value></value> <description>Regular expression naming plugin directory names to exclude. </description> </property> <property> <name>parser.character.encoding.default</name> <value>windows-1252</value> <description>The character encoding to fall back to when no other information is available</description> </property> <property> <name>parser.html.impl</name> <value>neko</value> <description>HTML Parser implementation. Currently the following keywords are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup. </description> </property> <!-- urlfilter plugin properties --> <property> <name>urlfilter.regex.file</name> <value>regex-urlfilter.txt</value> <description>Name of file on CLASSPATH containing regular expressions used by urlfilter-regex (RegexURLFilter) plugin.</description> </property> <property> <name>urlfilter.prefix.file</name> <value>prefix-urlfilter.txt</value> <description>Name of file on CLASSPATH containing url prefixes used by urlfilter-prefix (PrefixURLFilter) plugin.</description> </property> <property> <name>urlfilter.order</name> <value></value> <description>The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters. </description> </property> <!-- clustering extension properties --> <property> <name>extension.clustering.hits-to-cluster</name> <value>100</value> <description>Number of snippets retrieved for the clustering extension if clustering extension is available and user requested results to be clustered.</description> </property> <property> <name>extension.clustering.extension-name</name> <value></value> <description>Use the specified online clustering extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file.</description> </property> <!-- ontology extension properties --> <property> <name>extension.ontology.extension-name</name> <value></value> <description>Use the specified online ontology extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file.</description> </property> <property> <name>extension.ontology.urls</name> <value> </value> <description>Urls of owl files, separated by spaces, such as http://www.example.com/ontology/time.owl http://www.example.com/ontology/space.owl http://www.example.com/ontology/wine.owl Or file:/ontology/time.owl file:/ontology/space.owl file:/ontology/wine.owl You have to make sure each url is valid. By default, there is no owl file, so query refinement based on ontology is silently ignored. </description> </property> <!-- query-basic plugin properties --> <property> <name>query.url.boost</name> <value>4.0</value> <description> Used as a boost for url field in Lucene query. </description> </property> <property> <name>query.anchor.boost</name> <value>2.0</value> <description> Used as a boost for anchor field in Lucene query. </description> </property> <property> <name>query.title.boost</name> <value>1.5</value> <description> Used as a boost for title field in Lucene query. </description> </property> <property> <name>query.host.boost</name> <value>2.0</value> <description> Used as a boost for host field in Lucene query. </description> </property> <property> <name>query.phrase.boost</name> <value>1.0</value> <description> Used as a boost for phrase in Lucene query. Multiplied by boost for field phrase is matched in. </description> </property> <!-- language-identifier plugin properties --> <property> <name>lang.ngram.min.length</name> <value>1</value> <description> The minimum size of ngrams to uses to identify language (must be between 1 and lang.ngram.max.length). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. </description> </property> <property> <name>lang.ngram.max.length</name> <value>4</value> <description> The maximum size of ngrams to uses to identify language (must be between lang.ngram.min.length and 4). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. </description> </property> <property> <name>lang.analyze.max.length</name> <value>2048</value> <description> The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analysis, but the slowest it is. </description> </property> </nutch-conf>