Context Navigation

← Previous Change
Wiki History
Next Change →

0525

Timestamp:: May 25, 2010, 4:33:59 PM (15 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

waue/2010/0525

                       v1
+ * 此篇原文為 ： [http://zolomon.javaeye.com/blog/378871]
+ * 此設定檔的版本為  [http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/conf/nutch-default.xml Nutch 0.7]
+ * 聲明:
+   * 以翻譯為主(主要是nutch-default.xml),
+   * 外加筆者個人使用nutch的經驗,
+   * 外加官方nutch wiki上的FAQ中http://wiki.apache.org/nutch/FAQ的內容,
+   * 結合過去網友的nutch配置文件講解,
+ nutch-default.xml ：
+{{{
+#!xml
+<?xml version="1.0"?>
+<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
+<!--begin  首先是一些說明,描述了這篇文檔該怎樣使用之類的  begin-->
+<!-- 不可以直接修改此文檔. 但是可以複製所需的屬性(這裡把entry翻譯成了屬性,原文中的entry就是指<property></property> 之間的內容,或者更準確的說是不包括<value></value>等可變內容的)到nutch-site.xml並修改其值來使用.如果nutch-site.xml不存在的話請自己創建它. -->
+<!--end  創建nutch-site.xml的樣式可以有幾種,指定不同的xsl即可使用不同的樣式,如果網上出現了不同樣是的nutch配置文件請讀者朋友不要見怪.關於每個xsl所指定的樣式到底是什麼,這裡不對其進行描述,請讀者自己查閱nutch的壓縮包裡提供的xsl文件  end-->
+<!--begin nutch配置文件根元素 begin-->
+<nutch-conf>
+<!--begin  nutch配置文件中的屬性配置是分塊的,每一塊配置一部分屬性,結構可以清晰的區分出來,如果想修改什麼內容直接到那一塊地方去找相關屬性即可.比如下面這個HTTP properties就是http相關設置的屬性,後面還有ftp相關設置,searcher相關設置等等  begin-->
+<!-- HTTP properties -->
+<property>
+  <name>http.agent.name</name>
+  <value>NutchCVS</value>
+  <description>我們的 HTTP 'User-Agent' 請求頭.</description>
+</property>
+<!--end  筆者也不是很明確這個屬性到底是做什麼用的,但是它是nutch 1.0配置文件中3個必須屬性中的一個.有可能是apache蒐集nutch用戶信息所用.  end-->
+<property>
+  <name>http.robots.agents</name>
+  <value>NutchCVS,Nutch,*</value>
+  <description>我們要尋找 robots.txt 文件的目標 agent 字符串,可多個,
+  以逗號分隔, 按優先度降序排列.</description>
+</property>
+<!--end  要去讀取robots.txt文件是搜索引擎的協議規則, 我們的搜索引擎協定會去遵守robots.txt中所做的要求.關於robots.txt,可以參見http://www.robotstxt.org/  end-->
+<property>
+  <name>http.robots.403.allow</name>
+  <value>true</value>
+  <description>在/robots.txt不存在時,有些服務器返回 HTTP status 403 (Forbidden). 這一般也許意味著我們仍然對該網站進行抓取. 如果此屬性設為false, 我們會認為該網站不允許抓取而不去抓它.</description>
+</property>
+<property>
+  <name>http.agent.description</name>
+  <value>Nutch</value>
+  <description>同樣用在User-Agent頭中. 對bot- 更深入的解釋. 它(這個value中的字符串)將出現在agent.name後的括號中.
+  </description>
+</property>
+<property>
+  <name>http.agent.url</name>
+  <value>http://lucene.apache.org/nutch/bot.html</value>
+  <description>同樣用在User-Agent中. 它(指這個value中的字符串)將出現在agent.name後的字符串中, 只是個用於宣傳等的url地址.
+  </description>
+</property>
+<property>
+  <name>http.agent.email</name>
+  <value>nutch-agent@lucene.apache.org</value>
+  <description>在 HTTP 'From' 請求頭和 User-Agent 頭中, 用於宣傳的電子郵件地址.</description>
+</property>
+<property>
+  <name>http.agent.version</name>
+  <value>0.7.2</value>
+  <description>在 User-Agent 頭中用於宣傳的版本號.</description>
+</property>
+<property>
+  <name>http.timeout</name>
+  <value>10000</value>
+  <description>默認網絡超時, 單位毫秒.</description>
+</property>
+<property>
+  <name>http.max.delays</name>
+  <value>3</value>
+  <description>抓取一個頁面的推遲次數. 每次發現一個host很忙的時候, nutch會推遲fetcher.server.delay這麼久. 在http.max.delays次推遲發生過之後, 這次抓取就會放棄該頁.</description>
+</property>
+<property>
+  <name>http.content.limit</name>
+  <value>65536</value>
+  <description>下載內容最大限制, 單位bytes.
+  如果value中的值非零(>=0), 那麼大於這個值的部分將被截斷; 否則不截.
+  </description>
+</property>
+<!--end  這裡的下載不是指我們手工去點下載一個軟件.有些入門級讀者會誤把這個"下載"當做網頁上存在下載項(比如一個附件)的情況.我們所說的下載,是指只要我們在訪問一個網頁的時候,都會從網絡上把這個網頁下載下來,才能在自己的瀏覽器裡查看,打開一個網頁,或者訪問一個網頁的情況,就存在一次對這個網頁的下載過程  end-->
+<property>
+  <name>http.proxy.host</name>
+  <value></value>
+  <description>代理主機名. 如果為空, 則不使用代理.</description>
+</property>
+<property>
+  <name>http.proxy.port</name>
+  <value></value>
+  <description>代理主機端口.</description>
+</property>
+<property>
+  <name>http.verbose</name>
+  <value>false</value>
+  <description>If true, HTTP will log more verbosely.</description>
+</property>
+<!--end  具體效果不明, 有待進一步嘗試. 翻譯的結果大概是, 如果這個值為真, 那麼會對HTTP活動進行非常冗長的log. end-->
+<property>
+  <name>http.redirect.max</name>
+  <value>3</value>
+  <description>抓取時候最大redirect數, 如果網頁有超過這個數的redirect, fetcher就會嘗試下一個網頁(放棄這個網頁).</description>
+</property>
+<!-- FILE properties -->
+<property>
+  <name>file.content.limit</name>
+  <value>65536</value>
+  <description>下載內容的長度, 單位是bytes.
+  如果值不為零, 大於這個值的內容會被截掉; 否則 (零或負數), 不會有內容被截掉.
+  </description>
+</property>
+<property>
+  <name>file.content.ignored</name>
+  <value>true</value>
+  <description>如果為true, 在fetch過程中沒有文件內容會被存儲.
+  一般情況我們都是希望這樣做的, 因為 file:// 協議的 URL 通常意味著它在本地, 我們可以直接對它執行抓取與建立索引工作. 否則(如果不為真), 文件內容將被存儲.
+  !! NO IMPLEMENTED YET !! (!!  還沒實現  !!)
+  </description>
+</property>
+<!-- FTP properties -->
+<property>
+  <name>ftp.username</name>
+  <value>anonymous</value>
+  <description>ftp登陸用戶名.</description>
+</property>
+<property>
+  <name>ftp.password</name>
+  <value>anonymous@example.com</value>
+  <description>ftp登陸密碼.</description>
+</property>
+<property>
+  <name>ftp.content.limit</name>
+  <value>65536</value>
+  <description>文件內容長度上限, 單位是bytes.
+  如果這個值大於零, 大於這個值的內容會被截掉; 否則 (零或負數), 什麼都不會截. 注意: 傳統的
+  ftp RFCs從未提供部分傳輸 而且, 實際上, 有些ftp服務器無法處理客戶端強行關閉
+  我們努力嘗試去處理了這種情況, 讓它可以運行流暢.
+  </description>
+</property>
+<property>
+  <name>ftp.timeout</name>
+  <value>60000</value>
+  <description>默認ftp客戶端socket超時, 單位是毫秒. 也請查閱下邊的ftp.keep.connection屬性.</description>
+</property>
+<property>
+  <name>ftp.server.timeout</name>
+  <value>100000</value>
+  <description>一個對ftp服務器idle time的估計, 單位是毫秒. 對於多數fgp服務器來講120000毫秒是很典型的.
+  這個設置最好保守一點. 與ftp.timeout屬性一起, 它們用來決定我們是否需要刪除 (幹掉) 當前 ftp.client 實例並強制重新啟動另一個 ftp.client 實例. 這是需要的,因為一個fetcher線程也許不會在ftp client遠程超時斷開前按時進行下一個request
+  (可能會無所事事). 只有在ftp.keep.connection(參見下邊)是真的時候使用.
+  </description>
+</property>
+<property>
+  <name>ftp.keep.connection</name>
+  <value>false</value>
+  <description>是否保持ftp連接.在同一個主機上一遍又一遍反覆抓取時候很有用. 如果設為真, 它會避開連接, 登陸和目錄列表為子序列url安裝(原文用的setup,此處意思不同於install)解析器. 如果設為真, 那麼, 你必須保證(應該):
+  (1) ftp.timeout必須比ftp.server.timeout小
+  (2) ftp.timeout必須比(fetcher.threads.fetch * fetcher.server.delay)大
+  否則在線程日誌中會出現大量"delete client because idled too long"消息.</description>
+</property>
+<property>
+  <name>ftp.follow.talk</name>
+  <value>false</value>
+  <description>是否記錄我們的客戶端與遠程服務器之間的dialogue. 調試(debug)時候很有用.</description>
+</property>
+<!-- web db properties -->
+<property>
+  <name>db.default.fetch.interval</name>
+  <value>30</value>
+  <description>默認重抓一個網頁的(間隔)天數.
+  </description>
+</property>
+<property>
+  <name>db.ignore.internal.links</name>
+  <value>true</value>
+  <description>如果是真, 在給一個新網頁增加鏈接時, 從同一個主機的鏈接會被忽略. 這是一個非常有效的方法用來限制鏈接數據庫的大小, 只保持質量最高的一個鏈接.
+  </description>
+</property>
+<!--end  這個屬性對影響搜索引擎展示頁面的效果非常有用  end-->
+<property>
+  <name>db.score.injected</name>
+  <value>1.0</value>
+  <description>注入新頁面所需分數injector.
+  </description>
+</property>
+<!--end    end-->
+<property>
+  <name>db.score.link.external</name>
+  <value>1.0</value>
+  <description>添加新頁面時, 來自新主機頁面與原因熱面的分數因子 added due to a link from
+  another host relative to the referencing page's score.
+  </description>
+</property>
+<property>
+  <name>db.score.link.internal</name>
+  <value>1.0</value>
+  <description>The score factor for pages added due to a link from the
+  same host, relative to the referencing page's score.
+  </description>
+</property>
+<property>
+  <name>db.max.outlinks.per.page</name>
+  <value>100</value>
+  <description>我們會解析的從一個一頁面出發的外部鏈接的最大數量.</description>
+</property>
+<property>
+  <name>db.max.anchor.length</name>
+  <value>100</value>
+  <description>鏈接最大長度.</description>
+</property>
+<property>
+  <name>db.fetch.retry.max</name>
+  <value>3</value>
+  <description>抓取時最大重試次數.</description>
+</property>
+<!-- fetchlist tool properties -->
+<property>
+  <name>fetchlist.score.by.link.count</name>
+  <value>true</value>
+  <description>If true, set page scores on fetchlist entries based on
+  log(number of anchors), instead of using original page scores. This
+  results in prioritization of pages with many incoming links.
+  </description>
+</property>
+<!-- fetcher properties -->
+<property>
+  <name>fetcher.server.delay</name>
+  <value>5.0</value>
+  <description>The number of seconds the fetcher will delay between
+   successive requests to the same server.</description>
+</property>
+<property>
+  <name>fetcher.threads.fetch</name>
+  <value>10</value>
+  <description>同時使用的抓取線程數.
+    This is also determines the maximum number of requests that are
+    made at once (each FetcherThread handles one connection).</description>
+</property>
+<property>
+  <name>fetcher.threads.per.host</name>
+  <value>1</value>
+  <description>每主機允許的同時抓取最大線程數.</description>
+</property>
+<property>
+  <name>fetcher.verbose</name>
+  <value>false</value>
+  <description>如果為真, fetcher會做更多的log.</description>
+</property>
+<!-- parser properties -->
+<property>
+  <name>parser.threads.parse</name>
+  <value>10</value>
+  <description>ParseSegment同時應該使用的解析線程數.</description>
+</property>
+<!-- i/o properties -->
+<property>
+  <name>io.sort.factor</name>
+  <value>100</value>
+  <description>The number of streams to merge at once while sorting
+  files.  This determines the number of open file handles.</description>
+</property>
+<property>
+  <name>io.sort.mb</name>
+  <value>100</value>
+  <description>The total amount of buffer memory to use while sorting
+  files, in megabytes.  By default, gives each merge stream 1MB, which
+  should minimize seeks.</description>
+</property>
+<property>
+  <name>io.file.buffer.size</name>
+  <value>131072</value>
+  <description>The size of buffer for use in sequence files.
+  The size of this buffer should probably be a multiple of hardware
+  page size (4096 on Intel x86), and it determines how much data is
+  buffered during read and write operations.</description>
+</property>
+<!-- file system properties -->
+<property>
+  <name>fs.default.name</name>
+  <value>local</value>
+  <description>The name of the default file system.  Either the
+  literal string "local" or a host:port for NDFS.</description>
+</property>
+<property>
+  <name>ndfs.name.dir</name>
+  <value>/tmp/nutch/ndfs/name</value>
+  <description>Determines where on the local filesystem the NDFS name node
+      should store the name table.</description>
+</property>
+<property>
+  <name>ndfs.data.dir</name>
+  <value>/tmp/nutch/ndfs/data</value>
+  <description>Determines where on the local filesystem an NDFS data node
+      should store its blocks.</description>
+</property>
+<!-- map/reduce properties -->
+<property>
+  <name>mapred.job.tracker</name>
+  <value>localhost:8010</value>
+  <description>The host and port that the MapReduce job tracker runs at.
+  </description>
+</property>
+<property>
+  <name>mapred.local.dir</name>
+  <value>/tmp/nutch/mapred/local</value>
+  <description>The local directory where MapReduce stores temprorary files
+      related to tasks and jobs.
+  </description>
+</property>
+<!-- indexer properties -->
+<property>
+  <name>indexer.score.power</name>
+  <value>0.5</value>
+  <description>Determines the power of link analyis scores.  Each
+  pages's boost is set to <i>score<sup>scorePower</sup></i> where
+  <i>score</i> is its link analysis score and <i>scorePower</i> is the
+  value of this parameter.  This is compiled into indexes, so, when
+  this is changed, pages must be re-indexed for it to take
+  effect.</description>
+</property>
+<property>
+  <name>indexer.boost.by.link.count</name>
+  <value>true</value>
+  <description>When true scores for a page are multipled by the log of
+  the number of incoming links to the page.</description>
+</property>
+<property>
+  <name>indexer.max.title.length</name>
+  <value>100</value>
+  <description>The maximum number of characters of a title that are indexed.
+  </description>
+</property>
+<property>
+  <name>indexer.max.tokens</name>
+  <value>10000</value>
+  <description>
+  The maximum number of tokens that will be indexed for a single field
+  in a document. This limits the amount of memory required for
+  indexing, so that collections with very large files will not crash
+  the indexing process by running out of memory.
+  Note that this effectively truncates large documents, excluding
+  from the index tokens that occur further in the document. If you
+  know your source documents are large, be sure to set this value
+  high enough to accomodate the expected size. If you set it to
+  Integer.MAX_VALUE, then the only limit is your memory, but you
+  should anticipate an OutOfMemoryError.
+  </description>
+</property>
+<property>
+  <name>indexer.mergeFactor</name>
+  <value>50</value>
+  <description>The factor that determines the frequency of Lucene segment
+  merges. This must not be less than 2, higher values increase indexing
+  speed but lead to increased RAM usage, and increase the number of
+  open file handles (which may lead to "Too many open files" errors).
+  NOTE: the "segments" here have nothing to do with Nutch segments, they
+  are a low-level data unit used by Lucene.
+  </description>
+</property>
+<property>
+  <name>indexer.minMergeDocs</name>
+  <value>50</value>
+  <description>This number determines the minimum number of Lucene
+  Documents buffered in memory between Lucene segment merges. Larger
+  values increase indexing speed and increase RAM usage.
+  </description>
+</property>
+<property>
+  <name>indexer.maxMergeDocs</name>
+  <value>2147483647</value>
+  <description>This number determines the maximum number of Lucene
+  Documents to be merged into a new Lucene segment. Larger values
+  increase indexing speed and reduce the number of Lucene segments,
+  which reduces the number of open file handles; however, this also
+  increases RAM usage during indexing.
+  </description>
+</property>
+<property>
+  <name>indexer.termIndexInterval</name>
+  <value>128</value>
+  <description>Determines the fraction of terms which Lucene keeps in
+  RAM when searching, to facilitate random-access.  Smaller values use
+  more memory but make searches somewhat faster.  Larger values use
+  less memory but make searches somewhat slower.
+  </description>
+</property>
+<!-- analysis properties -->
+<property>
+  <name>analysis.common.terms.file</name>
+  <value>common-terms.utf8</value>
+  <description>The name of a file containing a list of common terms
+  that should be indexed in n-grams.</description>
+</property>
+<!-- searcher properties -->
+<property>
+  <name>searcher.dir</name>
+  <value>.</value>
+  <description>
+  Path to root of index directories.  This directory is searched (in
+  order) for either the file search-servers.txt, containing a list of
+  distributed search servers, or the directory "index" containing
+  merged indexes, or the directory "segments" containing segment
+  indexes.
+  </description>
+</property>
+<property>
+  <name>searcher.filter.cache.size</name>
+  <value>16</value>
+  <description>
+  Maximum number of filters to cache.  Filters can accelerate certain
+  field-based queries, like language, document format, etc.  Each
+  filter requires one bit of RAM per page.  So, with a 10 million page
+  index, a cache size of 16 consumes two bytes per page, or 20MB.
+  </description>
+</property>
+<property>
+  <name>searcher.filter.cache.threshold</name>
+  <value>0.05</value>
+  <description>
+  Filters are cached when their term is matched by more than this
+  fraction of pages.  For example, with a threshold of 0.05, and 10
+  million pages, the term must match more than 1/20, or 50,000 pages.
+  So, if out of 10 million pages, 50% of pages are in English, and 2%
+  are in Finnish, then, with a threshold of 0.05, searches for
+  "lang:en" will use a cached filter, while searches for "lang:fi"
+  will score all 20,000 finnish documents.
+  </description>
+</property>
+<property>
+  <name>searcher.hostgrouping.rawhits.factor</name>
+  <value>2.0</value>
+  <description>
+  A factor that is used to determine the number of raw hits
+  initially fetched, before host grouping is done.
+  </description>
+</property>
+<property>
+  <name>searcher.summary.context</name>
+  <value>5</value>
+  <description>
+  The number of context terms to display preceding and following
+  matching terms in a hit summary.
+  </description>
+</property>
+<property>
+  <name>searcher.summary.length</name>
+  <value>20</value>
+  <description>
+  The total number of terms to display in a hit summary.
+  </description>
+</property>
+<!-- URL normalizer properties -->
+<property>
+  <name>urlnormalizer.class</name>
+  <value>org.apache.nutch.net.BasicUrlNormalizer</value>
+  <description>Name of the class used to normalize URLs.</description>
+</property>
+<property>
+  <name>urlnormalizer.regex.file</name>
+  <value>regex-normalize.xml</value>
+  <description>Name of the config file used by the RegexUrlNormalizer class.</description></property>
+<!-- mime properties -->
+<property>
+  <name>mime.types.file</name>
+  <value>mime-types.xml</value>
+  <description>Name of file in CLASSPATH containing filename extension and
+  magic sequence to mime types mapping information</description>
+</property>
+<property>
+  <name>mime.type.magic</name>
+  <value>true</value>
+  <description>Defines if the mime content type detector uses magic resolution.
+  </description>
+</property>
+<!-- ipc properties -->
+<property>
+  <name>ipc.client.timeout</name>
+  <value>10000</value>
+  <description>Defines the timeout for IPC calls in milliseconds. </description>
+</property>
+<!-- plugin properties -->
+<property>
+  <name>plugin.folders</name>
+  <value>plugins</value>
+  <description>Directories where nutch plugins are located.  Each
+  element may be a relative or absolute path.  If absolute, it is used
+  as is.  If relative, it is searched for on the classpath.</description>
+</property>
+<property>
+  <name>plugin.includes</name>
+  <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
+  <description>Regular expression naming plugin directory names to
+  include.  Any plugin not matching this expression is excluded.
+  In any case you need at least include the nutch-extensionpoints plugin. By
+  default Nutch includes crawling just HTML and plain text via HTTP,
+  and basic indexing and search plugins.
+  </description>
+</property>
+<property>
+  <name>plugin.excludes</name>
+  <value></value>
+  <description>Regular expression naming plugin directory names to exclude.
+  </description>
+</property>
+<property>
+  <name>parser.character.encoding.default</name>
+  <value>windows-1252</value>
+  <description>The character encoding to fall back to when no other information
+  is available</description>
+</property>
+<property>
+  <name>parser.html.impl</name>
+  <value>neko</value>
+  <description>HTML Parser implementation. Currently the following keywords
+  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
+  </description>
+</property>
+<!-- urlfilter plugin properties -->
+<property>
+  <name>urlfilter.regex.file</name>
+  <value>regex-urlfilter.txt</value>
+  <description>Name of file on CLASSPATH containing regular expressions
+  used by urlfilter-regex (RegexURLFilter) plugin.</description>
+</property>
+<property>
+  <name>urlfilter.prefix.file</name>
+  <value>prefix-urlfilter.txt</value>
+  <description>Name of file on CLASSPATH containing url prefixes
+  used by urlfilter-prefix (PrefixURLFilter) plugin.</description>
+</property>
+<property>
+  <name>urlfilter.order</name>
+  <value></value>
+  <description>The order by which url filters are applied.
+  If empty, all available url filters (as dictated by properties
+  plugin-includes and plugin-excludes above) are loaded and applied in system
+  defined order. If not empty, only named filters are loaded and applied
+  in given order. For example, if this property has value:
+  org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter
+  then RegexURLFilter is applied first, and PrefixURLFilter second.
+  Since all filters are AND'ed, filter ordering does not have impact
+  on end result, but it may have performance implication, depending
+  on relative expensiveness of filters.
+  </description>
+</property>
+<!-- clustering extension properties -->
+<property>
+  <name>extension.clustering.hits-to-cluster</name>
+  <value>100</value>
+  <description>Number of snippets retrieved for the clustering extension
+  if clustering extension is available and user requested results
+  to be clustered.</description>
+</property>
+<property>
+  <name>extension.clustering.extension-name</name>
+  <value></value>
+  <description>Use the specified online clustering extension. If empty,
+  the first available extension will be used. The "name" here refers to an 'id'
+  attribute of the 'implementation' element in the plugin descriptor XML
+  file.</description>
+</property>
+<!-- ontology extension properties -->
+<property>
+  <name>extension.ontology.extension-name</name>
+  <value></value>
+  <description>Use the specified online ontology extension. If empty,
+  the first available extension will be used. The "name" here refers to an 'id'
+  attribute of the 'implementation' element in the plugin descriptor XML
+  file.</description>
+</property>
+<property>
+  <name>extension.ontology.urls</name>
+  <value>
+  </value>
+  <description>Urls of owl files, separated by spaces, such as
+  http://www.example.com/ontology/time.owl
+  http://www.example.com/ontology/space.owl
+  http://www.example.com/ontology/wine.owl
+  Or
+  file:/ontology/time.owl
+  file:/ontology/space.owl
+  file:/ontology/wine.owl
+  You have to make sure each url is valid.
+  By default, there is no owl file, so query refinement based on ontology
+  is silently ignored.
+  </description>
+</property>
+<!-- query-basic plugin properties -->
+<property>
+  <name>query.url.boost</name>
+  <value>4.0</value>
+  <description> Used as a boost for url field in Lucene query.
+  </description>
+</property>
+<property>
+  <name>query.anchor.boost</name>
+  <value>2.0</value>
+  <description> Used as a boost for anchor field in Lucene query.
+  </description>
+</property>
+<property>
+  <name>query.title.boost</name>
+  <value>1.5</value>
+  <description> Used as a boost for title field in Lucene query.
+  </description>
+</property>
+<property>
+  <name>query.host.boost</name>
+  <value>2.0</value>
+  <description> Used as a boost for host field in Lucene query.
+  </description>
+</property>
+<property>
+  <name>query.phrase.boost</name>
+  <value>1.0</value>
+  <description> Used as a boost for phrase in Lucene query.
+  Multiplied by boost for field phrase is matched in.
+  </description>
+</property>
+<!-- language-identifier plugin properties -->
+<property>
+  <name>lang.ngram.min.length</name>
+  <value>1</value>
+  <description> The minimum size of ngrams to uses to identify
+  language (must be between 1 and lang.ngram.max.length).
+  The larger is the range between lang.ngram.min.length and
+  lang.ngram.max.length, the better is the identification, but
+  the slowest it is.
+  </description>
+</property>
+<property>
+  <name>lang.ngram.max.length</name>
+  <value>4</value>
+  <description> The maximum size of ngrams to uses to identify
+  language (must be between lang.ngram.min.length and 4).
+  The larger is the range between lang.ngram.min.length and
+  lang.ngram.max.length, the better is the identification, but
+  the slowest it is.
+  </description>
+</property>
+<property>
+  <name>lang.analyze.max.length</name>
+  <value>2048</value>
+  <description> The maximum bytes of data to uses to indentify
+  the language (0 means full content analysis).
+  The larger is this value, the better is the analysis, but the
+  slowest it is.
+  </description>
+</property>
+</nutch-conf>
+}}}