Changes between Initial Version and Version 1 of waue/2010/0525


Ignore:
Timestamp:
May 25, 2010, 4:33:59 PM (14 years ago)
Author:
waue
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • waue/2010/0525

    v1 v1  
     1
     2
     3 * 此篇原文為 : [http://zolomon.javaeye.com/blog/378871]
     4
     5 * 此設定檔的版本為  [http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml Nutch 0.7]
     6
     7 * 聲明:
     8   * 以翻譯為主(主要是nutch-default.xml),
     9   * 外加筆者個人使用nutch的經驗,
     10   * 外加官方nutch wiki上的FAQ中http://wiki.apache.org/nutch/FAQ的內容,
     11   * 結合過去網友的nutch配置文件講解,
     12
     13 nutch-default.xml :
     14
     15{{{
     16#!xml
     17<?xml version="1.0"?>
     18<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
     19
     20
     21<!--begin  首先是一些說明,描述了這篇文檔該怎樣使用之類的  begin-->
     22<!-- 不可以直接修改此文檔. 但是可以複製所需的屬性(這裡把entry翻譯成了屬性,原文中的entry就是指<property></property> 之間的內容,或者更準確的說是不包括<value></value>等可變內容的)到nutch-site.xml並修改其值來使用.如果nutch-site.xml不存在的話請自己創建它. -->
     23<!--end  創建nutch-site.xml的樣式可以有幾種,指定不同的xsl即可使用不同的樣式,如果網上出現了不同樣是的nutch配置文件請讀者朋友不要見怪.關於每個xsl所指定的樣式到底是什麼,這裡不對其進行描述,請讀者自己查閱nutch的壓縮包裡提供的xsl文件  end-->
     24
     25
     26<!--begin nutch配置文件根元素 begin-->
     27<nutch-conf>
     28
     29
     30<!--begin  nutch配置文件中的屬性配置是分塊的,每一塊配置一部分屬性,結構可以清晰的區分出來,如果想修改什麼內容直接到那一塊地方去找相關屬性即可.比如下面這個HTTP properties就是http相關設置的屬性,後面還有ftp相關設置,searcher相關設置等等  begin-->
     31<!-- HTTP properties -->
     32
     33<property>
     34  <name>http.agent.name</name>
     35  <value>NutchCVS</value>
     36  <description>我們的 HTTP 'User-Agent' 請求頭.</description>
     37</property>
     38<!--end  筆者也不是很明確這個屬性到底是做什麼用的,但是它是nutch 1.0配置文件中3個必須屬性中的一個.有可能是apache蒐集nutch用戶信息所用.  end-->
     39
     40
     41
     42<property>
     43  <name>http.robots.agents</name>
     44  <value>NutchCVS,Nutch,*</value>
     45  <description>我們要尋找 robots.txt 文件的目標 agent 字符串,可多個,
     46  以逗號分隔, 按優先度降序排列.</description>
     47</property>
     48<!--end  要去讀取robots.txt文件是搜索引擎的協議規則, 我們的搜索引擎協定會去遵守robots.txt中所做的要求.關於robots.txt,可以參見http://www.robotstxt.org/  end-->
     49
     50<property>
     51  <name>http.robots.403.allow</name>
     52  <value>true</value>
     53  <description>在/robots.txt不存在時,有些服務器返回 HTTP status 403 (Forbidden). 這一般也許意味著我們仍然對該網站進行抓取. 如果此屬性設為false, 我們會認為該網站不允許抓取而不去抓它.</description>
     54</property>
     55
     56<property>
     57  <name>http.agent.description</name>
     58  <value>Nutch</value>
     59  <description>同樣用在User-Agent頭中. 對bot- 更深入的解釋. 它(這個value中的字符串)將出現在agent.name後的括號中.
     60  </description>
     61</property>
     62
     63<property>
     64  <name>http.agent.url</name>
     65  <value>http://lucene.apache.org/nutch/bot.html</value>
     66  <description>同樣用在User-Agent中. 它(指這個value中的字符串)將出現在agent.name後的字符串中, 只是個用於宣傳等的url地址.
     67  </description>
     68</property>
     69
     70<property>
     71  <name>http.agent.email</name>
     72  <value>nutch-agent@lucene.apache.org</value>
     73  <description>在 HTTP 'From' 請求頭和 User-Agent 頭中, 用於宣傳的電子郵件地址.</description>
     74</property>
     75
     76<property>
     77  <name>http.agent.version</name>
     78  <value>0.7.2</value>
     79  <description>在 User-Agent 頭中用於宣傳的版本號.</description>
     80</property>
     81
     82<property>
     83  <name>http.timeout</name>
     84  <value>10000</value>
     85  <description>默認網絡超時, 單位毫秒.</description>
     86</property>
     87
     88<property>
     89  <name>http.max.delays</name>
     90  <value>3</value>
     91  <description>抓取一個頁面的推遲次數. 每次發現一個host很忙的時候, nutch會推遲fetcher.server.delay這麼久. 在http.max.delays次推遲發生過之後, 這次抓取就會放棄該頁.</description>
     92</property>
     93
     94<property>
     95  <name>http.content.limit</name>
     96  <value>65536</value>
     97  <description>下載內容最大限制, 單位bytes.
     98  如果value中的值非零(>=0), 那麼大於這個值的部分將被截斷; 否則不截.
     99  </description>
     100</property>
     101<!--end  這裡的下載不是指我們手工去點下載一個軟件.有些入門級讀者會誤把這個"下載"當做網頁上存在下載項(比如一個附件)的情況.我們所說的下載,是指只要我們在訪問一個網頁的時候,都會從網絡上把這個網頁下載下來,才能在自己的瀏覽器裡查看,打開一個網頁,或者訪問一個網頁的情況,就存在一次對這個網頁的下載過程  end-->
     102
     103<property>
     104  <name>http.proxy.host</name>
     105  <value></value>
     106  <description>代理主機名. 如果為空, 則不使用代理.</description>
     107</property>
     108
     109<property>
     110  <name>http.proxy.port</name>
     111  <value></value>
     112  <description>代理主機端口.</description>
     113</property>
     114
     115<property>
     116  <name>http.verbose</name>
     117  <value>false</value>
     118  <description>If true, HTTP will log more verbosely.</description>
     119</property>
     120<!--end  具體效果不明, 有待進一步嘗試. 翻譯的結果大概是, 如果這個值為真, 那麼會對HTTP活動進行非常冗長的log. end-->
     121
     122<property>
     123  <name>http.redirect.max</name>
     124  <value>3</value>
     125  <description>抓取時候最大redirect數, 如果網頁有超過這個數的redirect, fetcher就會嘗試下一個網頁(放棄這個網頁).</description>
     126</property>
     127
     128<!-- FILE properties -->
     129
     130<property>
     131  <name>file.content.limit</name>
     132  <value>65536</value>
     133  <description>下載內容的長度, 單位是bytes.
     134  如果值不為零, 大於這個值的內容會被截掉; 否則 (零或負數), 不會有內容被截掉.
     135  </description>
     136</property>
     137
     138<property>
     139  <name>file.content.ignored</name>
     140  <value>true</value>
     141  <description>如果為true, 在fetch過程中沒有文件內容會被存儲.
     142  一般情況我們都是希望這樣做的, 因為 file:// 協議的 URL 通常意味著它在本地, 我們可以直接對它執行抓取與建立索引工作. 否則(如果不為真), 文件內容將被存儲.
     143  !! NO IMPLEMENTED YET !! (!!  還沒實現  !!)
     144  </description>
     145</property>
     146
     147<!-- FTP properties -->
     148
     149<property>
     150  <name>ftp.username</name>
     151  <value>anonymous</value>
     152  <description>ftp登陸用戶名.</description>
     153</property>
     154
     155<property>
     156  <name>ftp.password</name>
     157  <value>anonymous@example.com</value>
     158  <description>ftp登陸密碼.</description>
     159</property>
     160
     161<property>
     162  <name>ftp.content.limit</name>
     163  <value>65536</value>
     164  <description>文件內容長度上限, 單位是bytes.
     165  如果這個值大於零, 大於這個值的內容會被截掉; 否則 (零或負數), 什麼都不會截. 注意: 傳統的
     166  ftp RFCs從未提供部分傳輸 而且, 實際上, 有些ftp服務器無法處理客戶端強行關閉
     167  我們努力嘗試去處理了這種情況, 讓它可以運行流暢.
     168  </description>
     169</property>
     170
     171<property>
     172  <name>ftp.timeout</name>
     173  <value>60000</value>
     174  <description>默認ftp客戶端socket超時, 單位是毫秒. 也請查閱下邊的ftp.keep.connection屬性.</description>
     175</property>
     176
     177<property>
     178  <name>ftp.server.timeout</name>
     179  <value>100000</value>
     180  <description>一個對ftp服務器idle time的估計, 單位是毫秒. 對於多數fgp服務器來講120000毫秒是很典型的.
     181  這個設置最好保守一點. 與ftp.timeout屬性一起, 它們用來決定我們是否需要刪除 (幹掉) 當前 ftp.client 實例並強制重新啟動另一個 ftp.client 實例. 這是需要的,因為一個fetcher線程也許不會在ftp client遠程超時斷開前按時進行下一個request
     182  (可能會無所事事). 只有在ftp.keep.connection(參見下邊)是真的時候使用.
     183  </description>
     184</property>
     185
     186<property>
     187  <name>ftp.keep.connection</name>
     188  <value>false</value>
     189  <description>是否保持ftp連接.在同一個主機上一遍又一遍反覆抓取時候很有用. 如果設為真, 它會避開連接, 登陸和目錄列表為子序列url安裝(原文用的setup,此處意思不同於install)解析器. 如果設為真, 那麼, 你必須保證(應該):
     190  (1) ftp.timeout必須比ftp.server.timeout小
     191  (2) ftp.timeout必須比(fetcher.threads.fetch * fetcher.server.delay)大
     192  否則在線程日誌中會出現大量"delete client because idled too long"消息.</description>
     193</property>
     194
     195<property>
     196  <name>ftp.follow.talk</name>
     197  <value>false</value>
     198  <description>是否記錄我們的客戶端與遠程服務器之間的dialogue. 調試(debug)時候很有用.</description>
     199</property>
     200
     201<!-- web db properties -->
     202
     203<property>
     204  <name>db.default.fetch.interval</name>
     205  <value>30</value>
     206  <description>默認重抓一個網頁的(間隔)天數.
     207  </description>
     208</property>
     209
     210<property>
     211  <name>db.ignore.internal.links</name>
     212  <value>true</value>
     213  <description>如果是真, 在給一個新網頁增加鏈接時, 從同一個主機的鏈接會被忽略. 這是一個非常有效的方法用來限制鏈接數據庫的大小, 只保持質量最高的一個鏈接.
     214  </description>
     215</property>
     216<!--end  這個屬性對影響搜索引擎展示頁面的效果非常有用  end-->
     217
     218<property>
     219  <name>db.score.injected</name>
     220  <value>1.0</value>
     221  <description>注入新頁面所需分數injector.
     222  </description>
     223</property>
     224<!--end    end-->
     225
     226<property>
     227  <name>db.score.link.external</name>
     228  <value>1.0</value>
     229  <description>添加新頁面時, 來自新主機頁面與原因熱面的分數因子 added due to a link from
     230  another host relative to the referencing page's score.
     231  </description>
     232</property>
     233
     234<property>
     235  <name>db.score.link.internal</name>
     236  <value>1.0</value>
     237  <description>The score factor for pages added due to a link from the
     238  same host, relative to the referencing page's score.
     239  </description>
     240</property>
     241
     242<property>
     243  <name>db.max.outlinks.per.page</name>
     244  <value>100</value>
     245  <description>我們會解析的從一個一頁面出發的外部鏈接的最大數量.</description>
     246</property>
     247
     248<property>
     249  <name>db.max.anchor.length</name>
     250  <value>100</value>
     251  <description>鏈接最大長度.</description>
     252</property>
     253
     254<property>
     255  <name>db.fetch.retry.max</name>
     256  <value>3</value>
     257  <description>抓取時最大重試次數.</description>
     258</property>
     259
     260<!-- fetchlist tool properties -->
     261
     262<property>
     263  <name>fetchlist.score.by.link.count</name>
     264  <value>true</value>
     265  <description>If true, set page scores on fetchlist entries based on
     266  log(number of anchors), instead of using original page scores. This
     267  results in prioritization of pages with many incoming links.
     268  </description>
     269</property>
     270
     271<!-- fetcher properties -->
     272
     273<property>
     274  <name>fetcher.server.delay</name>
     275  <value>5.0</value>
     276  <description>The number of seconds the fetcher will delay between
     277   successive requests to the same server.</description>
     278</property>
     279
     280<property>
     281  <name>fetcher.threads.fetch</name>
     282  <value>10</value>
     283  <description>同時使用的抓取線程數.
     284    This is also determines the maximum number of requests that are
     285    made at once (each FetcherThread handles one connection).</description>
     286</property>
     287
     288<property>
     289  <name>fetcher.threads.per.host</name>
     290  <value>1</value>
     291  <description>每主機允許的同時抓取最大線程數.</description>
     292</property>
     293
     294<property>
     295  <name>fetcher.verbose</name>
     296  <value>false</value>
     297  <description>如果為真, fetcher會做更多的log.</description>
     298</property>
     299
     300<!-- parser properties -->
     301<property>
     302  <name>parser.threads.parse</name>
     303  <value>10</value>
     304  <description>ParseSegment同時應該使用的解析線程數.</description>
     305</property>
     306
     307<!-- i/o properties -->
     308
     309<property>
     310  <name>io.sort.factor</name>
     311  <value>100</value>
     312  <description>The number of streams to merge at once while sorting
     313  files.  This determines the number of open file handles.</description>
     314</property>
     315
     316<property>
     317  <name>io.sort.mb</name>
     318  <value>100</value>
     319  <description>The total amount of buffer memory to use while sorting
     320  files, in megabytes.  By default, gives each merge stream 1MB, which
     321  should minimize seeks.</description>
     322</property>
     323
     324<property>
     325  <name>io.file.buffer.size</name>
     326  <value>131072</value>
     327  <description>The size of buffer for use in sequence files.
     328  The size of this buffer should probably be a multiple of hardware
     329  page size (4096 on Intel x86), and it determines how much data is
     330  buffered during read and write operations.</description>
     331</property>
     332 
     333<!-- file system properties -->
     334
     335<property>
     336  <name>fs.default.name</name>
     337  <value>local</value>
     338  <description>The name of the default file system.  Either the
     339  literal string "local" or a host:port for NDFS.</description>
     340</property>
     341
     342<property>
     343  <name>ndfs.name.dir</name>
     344  <value>/tmp/nutch/ndfs/name</value>
     345  <description>Determines where on the local filesystem the NDFS name node
     346      should store the name table.</description>
     347</property>
     348
     349<property>
     350  <name>ndfs.data.dir</name>
     351  <value>/tmp/nutch/ndfs/data</value>
     352  <description>Determines where on the local filesystem an NDFS data node
     353      should store its blocks.</description>
     354</property>
     355
     356<!-- map/reduce properties -->
     357
     358<property>
     359  <name>mapred.job.tracker</name>
     360  <value>localhost:8010</value>
     361  <description>The host and port that the MapReduce job tracker runs at.
     362  </description>
     363</property>
     364
     365<property>
     366  <name>mapred.local.dir</name>
     367  <value>/tmp/nutch/mapred/local</value>
     368  <description>The local directory where MapReduce stores temprorary files
     369      related to tasks and jobs.
     370  </description>
     371</property>
     372
     373<!-- indexer properties -->
     374
     375<property>
     376  <name>indexer.score.power</name>
     377  <value>0.5</value>
     378  <description>Determines the power of link analyis scores.  Each
     379  pages's boost is set to <i>score<sup>scorePower</sup></i> where
     380  <i>score</i> is its link analysis score and <i>scorePower</i> is the
     381  value of this parameter.  This is compiled into indexes, so, when
     382  this is changed, pages must be re-indexed for it to take
     383  effect.</description>
     384</property>
     385
     386<property>
     387  <name>indexer.boost.by.link.count</name>
     388  <value>true</value>
     389  <description>When true scores for a page are multipled by the log of
     390  the number of incoming links to the page.</description>
     391</property>
     392
     393<property>
     394  <name>indexer.max.title.length</name>
     395  <value>100</value>
     396  <description>The maximum number of characters of a title that are indexed.
     397  </description>
     398</property>
     399
     400<property>
     401  <name>indexer.max.tokens</name>
     402  <value>10000</value>
     403  <description>
     404  The maximum number of tokens that will be indexed for a single field
     405  in a document. This limits the amount of memory required for
     406  indexing, so that collections with very large files will not crash
     407  the indexing process by running out of memory.
     408
     409  Note that this effectively truncates large documents, excluding
     410  from the index tokens that occur further in the document. If you
     411  know your source documents are large, be sure to set this value
     412  high enough to accomodate the expected size. If you set it to
     413  Integer.MAX_VALUE, then the only limit is your memory, but you
     414  should anticipate an OutOfMemoryError.
     415  </description>
     416</property>
     417
     418<property>
     419  <name>indexer.mergeFactor</name>
     420  <value>50</value>
     421  <description>The factor that determines the frequency of Lucene segment
     422  merges. This must not be less than 2, higher values increase indexing
     423  speed but lead to increased RAM usage, and increase the number of
     424  open file handles (which may lead to "Too many open files" errors).
     425  NOTE: the "segments" here have nothing to do with Nutch segments, they
     426  are a low-level data unit used by Lucene.
     427  </description>
     428</property>
     429
     430<property>
     431  <name>indexer.minMergeDocs</name>
     432  <value>50</value>
     433  <description>This number determines the minimum number of Lucene
     434  Documents buffered in memory between Lucene segment merges. Larger
     435  values increase indexing speed and increase RAM usage.
     436  </description>
     437</property>
     438
     439<property>
     440  <name>indexer.maxMergeDocs</name>
     441  <value>2147483647</value>
     442  <description>This number determines the maximum number of Lucene
     443  Documents to be merged into a new Lucene segment. Larger values
     444  increase indexing speed and reduce the number of Lucene segments,
     445  which reduces the number of open file handles; however, this also
     446  increases RAM usage during indexing.
     447  </description>
     448</property>
     449
     450<property>
     451  <name>indexer.termIndexInterval</name>
     452  <value>128</value>
     453  <description>Determines the fraction of terms which Lucene keeps in
     454  RAM when searching, to facilitate random-access.  Smaller values use
     455  more memory but make searches somewhat faster.  Larger values use
     456  less memory but make searches somewhat slower.
     457  </description>
     458</property>
     459
     460
     461<!-- analysis properties -->
     462
     463<property>
     464  <name>analysis.common.terms.file</name>
     465  <value>common-terms.utf8</value>
     466  <description>The name of a file containing a list of common terms
     467  that should be indexed in n-grams.</description>
     468</property>
     469
     470<!-- searcher properties -->
     471
     472<property>
     473  <name>searcher.dir</name>
     474  <value>.</value>
     475  <description>
     476  Path to root of index directories.  This directory is searched (in
     477  order) for either the file search-servers.txt, containing a list of
     478  distributed search servers, or the directory "index" containing
     479  merged indexes, or the directory "segments" containing segment
     480  indexes.
     481  </description>
     482</property>
     483
     484<property>
     485  <name>searcher.filter.cache.size</name>
     486  <value>16</value>
     487  <description>
     488  Maximum number of filters to cache.  Filters can accelerate certain
     489  field-based queries, like language, document format, etc.  Each
     490  filter requires one bit of RAM per page.  So, with a 10 million page
     491  index, a cache size of 16 consumes two bytes per page, or 20MB.
     492  </description>
     493</property>
     494
     495<property>
     496  <name>searcher.filter.cache.threshold</name>
     497  <value>0.05</value>
     498  <description>
     499  Filters are cached when their term is matched by more than this
     500  fraction of pages.  For example, with a threshold of 0.05, and 10
     501  million pages, the term must match more than 1/20, or 50,000 pages.
     502  So, if out of 10 million pages, 50% of pages are in English, and 2%
     503  are in Finnish, then, with a threshold of 0.05, searches for
     504  "lang:en" will use a cached filter, while searches for "lang:fi"
     505  will score all 20,000 finnish documents.
     506  </description>
     507</property>
     508
     509<property>
     510  <name>searcher.hostgrouping.rawhits.factor</name>
     511  <value>2.0</value>
     512  <description>
     513  A factor that is used to determine the number of raw hits
     514  initially fetched, before host grouping is done.
     515  </description>
     516</property>
     517
     518<property>
     519  <name>searcher.summary.context</name>
     520  <value>5</value>
     521  <description>
     522  The number of context terms to display preceding and following
     523  matching terms in a hit summary.
     524  </description>
     525</property>
     526
     527<property>
     528  <name>searcher.summary.length</name>
     529  <value>20</value>
     530  <description>
     531  The total number of terms to display in a hit summary.
     532  </description>
     533</property>
     534
     535<!-- URL normalizer properties -->
     536
     537<property>
     538  <name>urlnormalizer.class</name>
     539  <value>org.apache.nutch.net.BasicUrlNormalizer</value>
     540  <description>Name of the class used to normalize URLs.</description>
     541</property>
     542
     543<property>
     544  <name>urlnormalizer.regex.file</name>
     545  <value>regex-normalize.xml</value>
     546  <description>Name of the config file used by the RegexUrlNormalizer class.</description></property>
     547
     548<!-- mime properties -->
     549
     550<property>
     551  <name>mime.types.file</name>
     552  <value>mime-types.xml</value>
     553  <description>Name of file in CLASSPATH containing filename extension and
     554  magic sequence to mime types mapping information</description>
     555</property>
     556
     557<property>
     558  <name>mime.type.magic</name>
     559  <value>true</value>
     560  <description>Defines if the mime content type detector uses magic resolution.
     561  </description>
     562</property>
     563
     564<!-- ipc properties -->
     565
     566<property>
     567  <name>ipc.client.timeout</name>
     568  <value>10000</value>
     569  <description>Defines the timeout for IPC calls in milliseconds. </description>
     570</property>
     571
     572<!-- plugin properties -->
     573
     574<property>
     575  <name>plugin.folders</name>
     576  <value>plugins</value>
     577  <description>Directories where nutch plugins are located.  Each
     578  element may be a relative or absolute path.  If absolute, it is used
     579  as is.  If relative, it is searched for on the classpath.</description>
     580</property>
     581
     582<property>
     583  <name>plugin.includes</name>
     584  <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
     585  <description>Regular expression naming plugin directory names to
     586  include.  Any plugin not matching this expression is excluded.
     587  In any case you need at least include the nutch-extensionpoints plugin. By
     588  default Nutch includes crawling just HTML and plain text via HTTP,
     589  and basic indexing and search plugins.
     590  </description>
     591</property>
     592
     593<property>
     594  <name>plugin.excludes</name>
     595  <value></value>
     596  <description>Regular expression naming plugin directory names to exclude.
     597  </description>
     598</property>
     599
     600<property>
     601  <name>parser.character.encoding.default</name>
     602  <value>windows-1252</value>
     603  <description>The character encoding to fall back to when no other information
     604  is available</description>
     605</property>
     606
     607<property>
     608  <name>parser.html.impl</name>
     609  <value>neko</value>
     610  <description>HTML Parser implementation. Currently the following keywords
     611  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
     612  </description>
     613</property>
     614
     615<!-- urlfilter plugin properties -->
     616
     617<property>
     618  <name>urlfilter.regex.file</name>
     619  <value>regex-urlfilter.txt</value>
     620  <description>Name of file on CLASSPATH containing regular expressions
     621  used by urlfilter-regex (RegexURLFilter) plugin.</description>
     622</property>
     623
     624<property>
     625  <name>urlfilter.prefix.file</name>
     626  <value>prefix-urlfilter.txt</value>
     627  <description>Name of file on CLASSPATH containing url prefixes
     628  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
     629</property>
     630
     631<property>
     632  <name>urlfilter.order</name>
     633  <value></value>
     634  <description>The order by which url filters are applied.
     635  If empty, all available url filters (as dictated by properties
     636  plugin-includes and plugin-excludes above) are loaded and applied in system
     637  defined order. If not empty, only named filters are loaded and applied
     638  in given order. For example, if this property has value:
     639  org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
     640  then RegexURLFilter is applied first, and PrefixURLFilter second.
     641  Since all filters are AND'ed, filter ordering does not have impact
     642  on end result, but it may have performance implication, depending
     643  on relative expensiveness of filters.
     644  </description>
     645</property>
     646
     647<!-- clustering extension properties -->
     648
     649<property>
     650  <name>extension.clustering.hits-to-cluster</name>
     651  <value>100</value>
     652  <description>Number of snippets retrieved for the clustering extension
     653  if clustering extension is available and user requested results
     654  to be clustered.</description>
     655</property>
     656
     657<property>
     658  <name>extension.clustering.extension-name</name>
     659  <value></value>
     660  <description>Use the specified online clustering extension. If empty,
     661  the first available extension will be used. The "name" here refers to an 'id'
     662  attribute of the 'implementation' element in the plugin descriptor XML
     663  file.</description>
     664</property>
     665
     666<!-- ontology extension properties -->
     667
     668<property>
     669  <name>extension.ontology.extension-name</name>
     670  <value></value>
     671  <description>Use the specified online ontology extension. If empty,
     672  the first available extension will be used. The "name" here refers to an 'id'
     673  attribute of the 'implementation' element in the plugin descriptor XML
     674  file.</description>
     675</property>
     676
     677<property>
     678  <name>extension.ontology.urls</name>
     679  <value>
     680  </value>
     681  <description>Urls of owl files, separated by spaces, such as
     682  http://www.example.com/ontology/time.owl
     683  http://www.example.com/ontology/space.owl
     684  http://www.example.com/ontology/wine.owl
     685  Or
     686  file:/ontology/time.owl
     687  file:/ontology/space.owl
     688  file:/ontology/wine.owl
     689  You have to make sure each url is valid.
     690  By default, there is no owl file, so query refinement based on ontology
     691  is silently ignored.
     692  </description>
     693</property>
     694
     695<!-- query-basic plugin properties -->
     696
     697<property>
     698  <name>query.url.boost</name>
     699  <value>4.0</value>
     700  <description> Used as a boost for url field in Lucene query.
     701  </description>
     702</property>
     703
     704<property>
     705  <name>query.anchor.boost</name>
     706  <value>2.0</value>
     707  <description> Used as a boost for anchor field in Lucene query.
     708  </description>
     709</property>
     710
     711
     712<property>
     713  <name>query.title.boost</name>
     714  <value>1.5</value>
     715  <description> Used as a boost for title field in Lucene query.
     716  </description>
     717</property>
     718
     719<property>
     720  <name>query.host.boost</name>
     721  <value>2.0</value>
     722  <description> Used as a boost for host field in Lucene query.
     723  </description>
     724</property>
     725
     726<property>
     727  <name>query.phrase.boost</name>
     728  <value>1.0</value>
     729  <description> Used as a boost for phrase in Lucene query.
     730  Multiplied by boost for field phrase is matched in.
     731  </description>
     732</property>
     733
     734<!-- language-identifier plugin properties -->
     735
     736<property>
     737  <name>lang.ngram.min.length</name>
     738  <value>1</value>
     739  <description> The minimum size of ngrams to uses to identify
     740  language (must be between 1 and lang.ngram.max.length).
     741  The larger is the range between lang.ngram.min.length and
     742  lang.ngram.max.length, the better is the identification, but
     743  the slowest it is.
     744  </description>
     745</property>
     746
     747<property>
     748  <name>lang.ngram.max.length</name>
     749  <value>4</value>
     750  <description> The maximum size of ngrams to uses to identify
     751  language (must be between lang.ngram.min.length and 4).
     752  The larger is the range between lang.ngram.min.length and
     753  lang.ngram.max.length, the better is the identification, but
     754  the slowest it is.
     755  </description>
     756</property>
     757
     758<property>
     759  <name>lang.analyze.max.length</name>
     760  <value>2048</value>
     761  <description> The maximum bytes of data to uses to indentify
     762  the language (0 means full content analysis).
     763  The larger is this value, the better is the analysis, but the
     764  slowest it is.
     765  </description>
     766</property>
     767
     768</nutch-conf>
     769}}}