Changeset 73


Ignore:
Timestamp:
Jun 2, 2009, 5:39:43 PM (15 years ago)
Author:
waue
Message:

the version can make deb

Location:
nutchez-0.1
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • nutchez-0.1/CHANGES.txt

    r66 r73  
    1 Nutch Change Log
     1nutchez Change Log
    22
    3 Release 1.0 - 2009-03-23
    4 
    5  1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
    6 
    7  2. NUTCH-443 - Allow parsers to return multiple Parse objects.
    8     (Dogacan Guney et al, via ab)
    9 
    10  3. NUTCH-393 - Indexer should handle null documents returned by filters.
    11     (Eelco Lempsink via ab)
    12 
    13  4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
    14 
    15  5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
    16     bots in robots.txt (Dogacan Guney via siren)
    17 
    18  6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
    19  
    20  7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
    21     (siren)
    22 
    23  8. NUTCH-161 - Change Plain text parser to
    24     use parser.character.encoding.default property for fall back encoding
    25     (KuroSaka TeruHiko, siren)
    26 
    27  9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
    28     unmodified content. (ab)
    29 
    30 10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
    31     (cutting via ab)
    32 
    33 11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
    34 
    35 12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
    36     up the rss parser (dogacan via mattmann). This update is a fix and semantics
    37     change from the original patch for NUTCH-443. The original patch did not tell
    38     the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
    39     datums. This patch addresses that issue. Now, if Fetcher gets a null content,
    40     instead of pushing an empty content, it filters the null content.
    41    
    42 13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
    43     Parse object. (Gal Nitzan via dogacan)
    44 
    45 14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
    46     some query parameters. (Emmanuel Joke via dogacan)
    47 
    48 15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
    49     (Ilya Vishnevsky via dogacan)
    50    
    51 16. NUTCH-444 Possibly use a different library to parse RSS feed for improved
    52     performance and compatibility. This patch introduced a new plugin, feed,
    53     that includes an index filter and a parse plugin for feeds that uses ROME.
    54     There was discussion to remove parse-rss, in light of the feed plugin,
    55     however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
    56 
    57 17. NUTCH-471 - Fix synchronization in NutchBean creation.
    58     (Enis Soztutar via dogacan)
    59 
    60 18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
    61 
    62 19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
    63     once. (dogacan)
    64 
    65 20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
    66 
    67 21. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in
    68   DomContentUtils...Spider Trap. (kubes)
    69 
    70 22. NUTCH-434 - Replace usage of ObjectWritable with something based on
    71     GenericWritable. (dogacan)
    72 
    73 23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
    74 
    75 24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
    76     (Espen Amble Kolstad via dogacan)
    77 
    78 25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
    79     (Emmanuel Joke via dogacan)
    80 
    81 26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
    82     (Vishal Shah via dogacan)
    83 
    84 27. NUTCH-505 - Outlink urls should be validated. (dogacan)
    85 
    86 28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
    87 
    88 29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
    89 
    90 30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
    91 
    92 30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
    93 
    94 31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
    95 
    96 32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
    97     (Enis Soztutar via dogacan)
    98 
    99 33. NUTCH-516 - Next fetch time is not set when it is a
    100     CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
    101 
    102 34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
    103     when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
    104 
    105 35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
    106     (dogacan) Note: There is a bigger problem, i.e how to deal
    107     with redirected pages, and this issue can be considered as a band-aid
    108     for the time being. See NUTCH-273 and NUTCH-353 for more details.
    109 
    110 36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
    111     inlinks list. (Emmanuel Joke via dogacan)
    112 
    113 37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
    114     parse. (dogacan)
    115 
    116 38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
    117 
    118 39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
    119 
    120 40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
    121     domain-related utilities. (Enis Soztutar via dogacan)
    122 
    123 41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
    124     release (2.1). (Dawid Weiss via dogacan)
    125 
    126 42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
    127     request. (Dawid Weiss via dogacan)
    128 
    129 43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
    130     (Emmanuel Joke via dogacan)
    131 
    132 44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
    133 
    134 45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
    135 
    136 46. NUTCH-554 - Generator throws IOException on invalid urls.
    137     (Brian Whitman via ab)
    138 
    139 47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
    140     (Emmanuel Joke via dogacan)
    141 
    142 48. NUTCH-25 - needs 'character encoding' detector.
    143     (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
    144 
    145 49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
    146     to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
    147    
    148 50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
    149     (mattmann)
    150    
    151 51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
    152     list. (Emmanuel Joke, Marcin Okraszewski via kubes)
    153 
    154 52. NUTCH-501 -  Implement a different caching mechanism for objects cached in
    155     configuration. (dogacan)
    156 
    157 53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
    158 
    159 54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
    160 
    161 55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
    162     (dogacan, kubes via dogacan)
    163 
    164 56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
    165     (Emmanuel Joke via dogacan)
    166 
    167 57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
    168 
    169 58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
    170 
    171 59. NUTCH-574 - Including inlink anchor text in index can create irrelevant
    172     search results.  Created index-anchor plugin, removed functionality from
    173     index-basic plugin. For backwards compatibility, add index-anchor plugin to
    174     nutch-site.xml plugin.includes. (kubes)
    175 
    176 60. NUTCH-581 - DistributedSearch does not update search servers added to
    177     search-servers.txt on the fly.  (Rohan Mehta via kubes)
    178 
    179 61. NUTCH-586 - Add option to run compiled classes without job file
    180     (enis via ab)
    181 
    182 62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
    183     server. (Susam Pal via dogacan)
    184 
    185 63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
    186 
    187 64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
    188     (Emmanuel Joke via ab)
    189 
    190 65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
    191 
    192 66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
    193 
    194 67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
    195 
    196 68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
    197 
    198 69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
    199 
    200 70. NUTCH-602 - Allow configurable number of handlers for search servers
    201     (hartbecke via kubes)
    202 
    203 71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
    204 
    205 72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
    206 
    207 73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
    208 
    209 74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
    210 
    211 75. NUTCH-603 - Add more default url normalizations (kubes)
    212 
    213 76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
    214 
    215 77. NUTCH-44 - Too many search results, limits max results returned from a
    216     single search. (Emilijan Mirceski and Susam Pal via kubes)
    217 
    218 78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
    219     updated to 1.2 version. (dogacan)
    220 
    221 79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
    222 
    223 80. NUTCH-612 - URL filtering was disabled in Generator when invoked
    224     from Crawl (Susam Pal via ab)
    225 
    226 81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
    227 
    228 82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
    229 
    230 83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
    231 
    232 84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
    233     Guard against reprUrl being null. (Emmanuel Joke, ab)
    234 
    235 85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
    236     Joke, ab)
    237 
    238 86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
    239 
    240 87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
    241 
    242 88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
    243     (Emmanuel Joke, dogacan, ab)
    244 
    245 89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
    246     single slash. (Mark DeSpain via ab)
    247 
    248 90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
    249     (Emmanuel Joke via kubes)
    250 
    251 91. NUTCH-596 - ParseSegments parse content even if its not
    252     CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
    253    
    254 92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
    255 
    256 93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
    257     Ritter, ab)
    258 
    259 94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
    260 
    261 95. NUTCH-645 - Parse-swf unit test failing (ab)
    262 
    263 96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
    264 
    265 97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
    266     private to _public_ (Guillaume Smet via dogacan)
    267 
    268 98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
    269     tracking. (dogacan)
    270 
    271 99. NUTCH-375 - Add support for Content-Encoding: deflated
    272     (Pascal Beis, ab)
    273 
    274 100. NUTCH-633 - ParseSegment no longer allow reparsing.
    275      (dogacan)
    276 
    277 101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
    278 
    279 102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
    280 
    281 103. NUTCH-654 - urlfilter-regex's main does not work.
    282      (dogacan)
    283 
    284 104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
    285      (dogacan)
    286      
    287 105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
    288 
    289 106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
    290 
    291 107. NUTCH-647 - Resolve URLs tool (kubes)
    292 
    293 108. NUTCH-665 - Search Load Testing Tool (kubes)
    294 
    295 109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
    296                  (kubes)
    297 
    298 110. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
    299 
    300 111. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
    301 
    302 112. NUTCH-668 -  Domain URL Filter. (kubes)
    303 
    304 113. NUTCH-594 -  Serve Nutch search results in multiple formats including
    305                   XML and JSON. (kubes)
    306 
    307 114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
    308 
    309 115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
    310                  fetch interval correctly. (dogacan)
    311 
    312 116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
    313 
    314 117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
    315                  (julien nioche via dogacan)
    316 
    317 118. NUTCH-681 - parse-mp3 compilation problem.
    318                  (Wildan Maulana via dogacan)
    319 
    320 119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
    321                  (dogacan)
    322 
    323 120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
    324                  digest. (dogacan)
    325 
    326 121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
    327                  (Joseph Chen, dogacan)
    328 
    329 122. NUTCH-682 - SOLR indexer does not set boost on the document.
    330                  (julien nioche via dogacan)
    331 
    332 123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
    333 
    334 124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
    335 
    336 125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
    337 
    338 126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
    339      (Curtis d'Entremont, ab)
    340 
    341 127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
    342 
    343 128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
    344      (Stefan Will, siren)
    345      
    346 129. NUTCH-691 - Update jakarta poi jars to the most relevant version
    347      (Dmitry Lihachev via siren)
    348 
    349 130. NUTCH-563 - Include custom fields in BasicQueryFilter
    350      (Julien Nioche via siren)
    351      
    352 131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
    353      (Dmitry Lihachev via siren)
    354      
    355 132. NUTCH-694 - Distributed Search Server fails (siren)
    356 
    357 133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
    358      set at cross domain redirects (Remco Verhoef, dogacan via siren)
    359 
    360 134. NUTCH-247 - Robot parser to restrict (kubes, siren)
    361 
    362 135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
    363      via siren)
    364      
    365 136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
    366      Dmitry Lihachev via siren)
    367 
    368 137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
    369 
    370 138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
    371      Doug Cook via ab)
    372      
    373 139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
    374 
    375 140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
    376 
    377 141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
    378 
    379 142. NUTCH-684 - Dedup support for Solr. (dogacan)
    380 
    381 143. NUTCH-715 - Subcollection plugin doesn't work with default
    382      subcollections.xml file (Dmitry Lihachev via siren)
    383      
    384 144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
    385 
    386 Release 0.9 - 2007-04-02
    387 
    388  1. Changed log4j confiquration to log to stdout on commandline
    389     tools (siren)
    390 
    391  2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
    392  
    393  3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
    394     siren)
    395 
    396  4. Optionally skip pages with abnormally large values of Crawl-Delay
    397     (Dennis Kubes via ab)
    398 
    399  5. Change readdb -stats to use CombiningCollector (ab)
    400 
    401  6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
    402     Schneider and Stefan Groschupf via ab)
    403 
    404  7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
    405     dependant jars (siren)
    406    
    407  8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
    408     in parse-plugins.xml (Chris A. Mattmann via siren)
    409    
    410  9. NUTCH-105 - Network error during robots.txt fetch causes file to
    411     be ignored (Greg Kim via siren)
    412    
    413 10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
    414 
    415 11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
    416     to the current page (e.g. anchors). (Stefan Groschupf via ab)
    417 
    418 12. NUTCH-365 - Flexible URL normalization (ab)
    419 
    420 13. NUTCH-336 - Differentiate between newly discovered pages and newly
    421     injected pages (Chris Schneider via ab) NOTE: this changes the
    422     scoring API, filter implementations need to be updated.
    423 
    424 14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
    425     via ab)
    426 
    427 15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
    428     (Stefan Groschupf via ab)
    429 
    430 16. NUTCH-374 - when http.content.limit be set to -1 and 
    431     Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
    432     (King Kong via pkosiorowski)
    433 
    434 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
    435 
    436   ****************************** WARNING !!! ********************************
    437   * This upgrade breaks data format compatibility. A tool 'convertdb'       *
    438   * was added to migrate existing CrawlDb-s to the new format. Segment data *
    439   * can be partially migrated using 'mergesegs', however segments will      *
    440   * require re-parsing (and consequently re-indexing).                      *
    441   ****************************** WARNING !!! ********************************
    442 
    443 18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
    444     the algorithm. (ab)
    445 
    446 19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
    447     find parser (siren)
    448 
    449 20. NUTCH-379 - ParseUtil does not pass through the content's URL to the
    450     ParserFactory (Chris A. Mattmann via siren)
    451 
    452 21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
    453     partition. (ab)
    454 
    455 22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
    456 
    457 23. NUTCH-395 - Increase fetching speed (siren)
    458 
    459 24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
    460     (reported by Jared Dunne)
    461 
    462 25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
    463 
    464 26. NUTCH-403 - Make URL filtering optional in Generator (siren)
    465 
    466 27. NUTCH-405 - Content object is not properly initialized in map method
    467     of ParseSegment (siren)
    468 
    469 28. NUTCH-362 - Remove parse-text from unsupported filetypes in
    470     parse-plugins.xml (siren)
    471    
    472 29. NUTCH-305 - Update crawl and url filter lists to exclude
    473     jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan
    474     Neufeind) is also updated (siren)
    475    
    476 30. NUTCH-406 - Metadata tries to write null values (mattmann)
    477 
    478 31. NUTCH-415 - Generator should mark selected records in CrawlDb.
    479     Due to increased resource consumption this step is optional.
    480     Application-level locking has been added to prevent concurrent
    481     modification of databases. (ab)
    482 
    483 32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
    484     now possible to correctly update CrawlDb from multiple segments.
    485     Introduce new status codes for temporary and permanent
    486     redirection. (ab)
    487 
    488 33. NUTCH-322 - Fix Fetcher to store redirected pages and to store
    489     protocol-level status. This also should fix NUTCH-273. (ab)
    490 
    491 34. Change default Fetcher behavior not to follow redirects immediately.
    492     Instead Fetcher will record redirects as new pages to be added to CrawlDb.
    493     This also partially addresses NUTCH-273. (ab)
    494 
    495 35. Detect and report when Generator creates 0-sized segments. (ab)
    496 
    497 36. Fix Injector to preserve already existing CrawlDatum if the seed list
    498     being injected also contains such URL. (ab)
    499 
    500 37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
    501     skipping bad URLs. (Michael Stack via ab)
    502 
    503 38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
    504     Filters that are not in plugin.includes (Stefan Groschupf, siren)
    505    
    506 39. NUTCH-421 - Allow predeterminate running order of indexing filters
    507     (Alan Tanaman, siren)
    508 
    509 40. When indexing pages with redirection, drop all intermediate pages and
    510     index only the final page. (ab)
    511 
    512 41. Upgrade to Hadoop 0.10.1. (ab)
    513 
    514 42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
    515     order in which IndexDoc-s are processed. (Dogacan Guney via ab)
    516 
    517 43. NUTCH-428 - NullPointerException thrown when agent name is not
    518     configured properly. Changed to throw RuntimeException instead.
    519     (siren)
    520 
    521 44. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
    522 
    523 45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
    524 
    525 46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
    526     or indexing from hadoop.io.DataOutputBuffer (siren)
    527 
    528 47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
    529 
    530 48. NUTCH-390 - Javadoc warnings (mattmann)
    531 
    532 49. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
    533 
    534 50. NUTCH-432 - Fix a bug where platform name with spaces would break the
    535     bin/nutch script. (Brian Whitman via ab)
    536 
    537 51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
    538 
    539 52. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
    540 
    541 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
    542     framework to operate properly (Heiko Dietze via mattmann)
    543 
    544 54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
    545     Groschupf via kubes)
    546    
    547 55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
    548     path is empty (kubes)
    549 
    550 56. Upgrade to Hadoop 0.12.1 release. (ab)
    551 
    552 57. NUTCH-246 - Incorrect segment size being generated due to time
    553     synchronization issue (Stefan Groschupf via ab)
    554 
    555 58. Upgrade to Hadoop 0.12.2 release. (ab)
    556 
    557 59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
    558     Stack and Dogacan Guney via kubes)
    559 
    560 Release 0.8 - 2006-07-25
    561 
    562  0. Totally new architecture, based on hadoop
    563     [http://lucene.apache.org/hadoop] (cutting)
    564 
    565  1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
    566 
    567  2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
    568     (Rod Taylor via cutting)
    569 
    570  3. NUTCH-88 - Enhance ParserFactory plugin selection policy
    571     (jerome)
    572 
    573  4. NUTCH-124 - Protocol-httpclient does not follow redirects when
    574     fetching robots.txt (cutting)
    575 
    576  5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
    577     (stack@archive.org, cutting)
    578 
    579  6. NUTCH-114 - Getting number of urls and links from crawldb
    580     (Stefan Groschupf via ab)
    581 
    582  7. NUTCH-112 - Link in cached.jsp page to cached content is an
    583     absolute link (Chris A. Mattmann via jerome)
    584 
    585  8. NUTCH-135 - Http header meta data are case insensitive in the
    586     real world (Stefan Groschupf via jerome)
    587 
    588  9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
    589     to UTF-8 BOM (KuroSaka TeruHiko via siren)
    590 
    591 10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
    592 
    593 11. Added support for OpenSearch (cutting)
    594 
    595 12. NUTCH-142 - NutchConf should use the thread context classloader
    596     (Mike Cannon-Brookes via pkosiorowski)
    597 
    598 13. NUTCH-160 - Use standard Java Regex library rather than
    599     org.apache.oro.text.regex (Rod Taylor via cutting)
    600 
    601 14. NUTCH-151 - CommandRunner can hang after the main thread exec is
    602     finished and has inefficient busy loop (Paul Baclace via cutting)
    603 
    604 15. NUTCH-174 - Problem encountered with ant during compilation
    605 
    606 16. NUTCH-190 - ParseUtil drops reason for failed parse
    607     (stack@archive.org via ab)
    608 
    609 17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
    610 
    611 18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
    612 
    613 19. NUTCH-178 - in search.jsp must be session creation "false"
    614     (YourSoft via siren)
    615 
    616 20. NUTCH-200 - OpenSearch Servlet ist broken
    617     (Marko Bauhardt via siren)
    618 
    619 21. NUTCH-81 - Webapp only works when deployed in root
    620     (AJ Banck, Michael Nebel via siren)
    621 
    622 22. NUTCH-139 - Standard metadata property names in the ParseData
    623     metadata (Chris A. Mattmann, jerome)
    624 
    625 23. NUTCH-192 - Meta data support for CrawlDatum
    626     (Stefan Groschupf via ab)
    627    
    628 24. NUTCH-52 - Parser plugin for MS Excel files
    629     (Rohit Kulkarni via jerome)
    630 
    631 25. NUTCH-53 -  Parser plugin for Zip files
    632     (Rohit Kulkarni via jerome)
    633 
    634 26. NUTCH-137 - footer is not displayed in search result page
    635     (KuroSaka TeruHiko via siren)
    636 
    637 27. NUTCH-118 - FAQ link points to invalid URL
    638     (Steve Betts via siren)
    639 
    640 28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
    641     translation (Ivan Sekulovic via siren)
    642 
    643 29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
    644     via cutting)
    645 
    646 30. NUTCH-140 - Add alias capability in parse-plugins.xml file that
    647     allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
    648 
    649 31. NUTCH-214 - Added Links to web site to search mailling list
    650     (Jake Vanderdray via jerome)
    651 
    652 32. NUTCH-204 - Multiple field values in HitDetails
    653     (Stefan Groschupf via jerome)
    654 
    655 33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
    656     to -1 to be consistent with http (jerome)
    657    
    658 34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
    659 
    660 35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
    661     pkosiorowski)
    662 
    663 36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
    664     jerome)
    665 
    666 37. NUTCH-229 - Improved handling of plugin folder configuration
    667     (Stefan Groschupf via ab)
    668 
    669 38. NUTCH-206 - Search server throws InstantiationException (ab)
    670    
    671 39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
    672     via ab)
    673 
    674 40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
    675 
    676 41. Update to lucene 1.9.1 (cutting)
    677 
    678 42. NUTCH-235 - Duplicate Inlink values (ab)
    679 
    680 43. NUTCH-234 - Clustering extension code cleanups and a real
    681     JUnit test case for the current implementation (Dawid Weiss via ab)
    682    
    683 44. NUTCH-210 - Context.xml file for Nutch web application
    684     (Chris A. Mattmann via jerome)
    685 
    686 45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
    687 
    688 46. NUTCH-232 - Search.jsp has multiple search forms creating
    689     invalid html / incorrect focus function (jerome)
    690    
    691 47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
    692 
    693 48. NUTCH-244 - Inconsistent handling of property values
    694     boundaries / unable to set db.max.outlinks.per.page to
    695     infinite (jerome)
    696    
    697 49. NUTCH-245 - DTD for plugin.xml configuration files
    698     (Chris A. Mattmann via jerome)
    699 
    700 50. NUTCH-250 - Generate to log truncation caused by
    701     generate.max.per.host (Rod Taylor via cutting)
    702    
    703 51. NUTCH-125 - OpenOffice Parser plugin (ab)
    704 
    705 52. Switch from using java.io.File to org.apache.hadoop.fs.Path.
    706     (cutting)
    707 
    708 53. NUTCH-240 - Scoring API: extension point, scoring filters and
    709     an OPIC plugin (ab)
    710    
    711 54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
    712 
    713 55. NUTCH-268 - Generator and lib-http use different definitions of
    714     "unique host" (ab)
    715    
    716 56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
    717     via siren)
    718 
    719 57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
    720     (Dennis Kubes via ab)
    721 
    722 58. NUTCH-201 - Add support for subcollections
    723     (siren)
    724 
    725 59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
    726     (Stefan Groschupf via jerome)
    727 
    728 60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
    729 
    730 61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
    731     (Stefan Groschupf via jerome)
    732 
    733 62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
    734     (stack@archive.org via siren)
    735 
    736 63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
    737     (Stefan Neufeind via siren)
    738 
    739 64. NUTCH-307 - Wrong configured log4j.properties (jerome)
    740 
    741 65. NUTCH-303 - Logging improvements (jerome)
    742 
    743 66. NUTCH-308 - Maximum search time limit (ab)
    744 
    745 67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
    746     problem (Grant Glouser via siren)
    747 
    748 68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
    749 
    750 69. NUTCH-317 - Clarify what the queryLanguage argument of
    751     Query.parse(...) means (jerome)
    752 
    753 70. Added alternative experimental web gui in contrib containing
    754     extensions like subcollection, keymatch, user preferences,
    755     caching, implemented mainly using tiles and jstl (siren)
    756 
    757 71. NUTCH-320 DmozParser does not output list of urls to stdout
    758     but to a log file instead. Original functionality restored.
    759 
    760 72. NUTCH-271 - Add ability to limit crawling to the set of initially
    761     injected hosts (db.ignore.external.links) (Philippe Eugene,
    762     Stefan Neufeind via ab)
    763 
    764 73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
    765 
    766 74. NUTCH-327 - Fixed logging directory on cygwin (siren)
    767 
    768 Release 0.7 - 2005-08-17
    769 
    770  1. Added support for "type:" in queries. Search results are limited/qualified
    771     by mimetype or its primary type or sub type. For example,
    772     (1) searching with "type:application/pdf" restricts results
    773     to pages which were identified to be of mimetype "application/pdf".
    774     (2) with "type:application", nutch will return pages of
    775     primary type "application".
    776     (3) with "type:pdf", only pages of sub type "pdf" will be listed.
    777     (John Xing, 20050120)
    778 
    779  2. Added support for "date:" in queries. Last-Modified is indexed.
    780     Search results are restricted by lower and upper date (inclusive)
    781     as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
    782     only returns pages with Last-Modified in year 2004.
    783     (John Xing, 20050122)
    784 
    785  3. Add URLFilter plugin interface and convert existing url filters into
    786     plugins. (John Xing, 20050206)
    787 
    788  4. Add UpdateSegmentsFromDb tool, which updates the scores and
    789     anchors of existing segments with the current values in the web
    790     db.  This is used by CrawlTool, so that pages are now only fetched
    791     once per crawl.  (Doug Cutting, 20050221)
    792 
    793  5. Moved code into org.apache.nutch sub-packages.  Changed license to
    794     Apache 2.0.  Removed jar files whose licenses do not permit
    795     redistribution by Apache.  Disabled compilation of plugins which
    796     require these libraries.  (Doug Cutting 20050301)
    797 
    798  6. Index host and title in separate fields.  Host was indexed
    799     previously only as a part of the URL.  Title was indexed as an
    800     anchor.  Now boosts for matching these fields may be adjusted
    801     separately from boosts for matching anchors and url.  Also: move
    802     site indexing to index-basic plugin to minimize the number of
    803     times the URL needs to be parsed; and, stop using anchor analyzer
    804     for anything but anchors.  (Piotr Kosiorowski via Doug Cutting
    805     20050323)
    806 
    807  7. Add servlet Cached.java that serves cached Content of any mime type.
    808     Slightly modified are web.xml and cached.jsp.
    809     (John Xing, 20050401)
    810 
    811  8. Add skipCompressedByteArray() to WritableUtils.java.
    812     (John Xing, 20050402)
    813 
    814  9. Fixes to jsp and static web pages.  These now use relative links,
    815     so that the Nutch webapp file can be used in places other than at
    816     the root.  Also fixed links to the about and help pages.  Bug #32.
    817     (Jerome Charron via cutting, 20050404)
    818 
    819 10. Added some features to DistributedSearch: new segments can be added
    820     to searchservers without restarting the frontend, defective search
    821     servers are not queried until tey come back online, watchdog keeps
    822     an eye for your searchservers and writes simple statistics.
    823     (Sami Siren, 20050407)
    824    
    825 11. Fix for bug #4 - Unbalanced quote in query eats all resources.
    826   (Piotr Kosiorowski, Sami Siren, 20050407)
    827 
    828 12. Close Issue #33 - MIME content type detector (using magic char sequences).
    829     (Jerome Charron and Hari Kodungallur via John Xing, 20050416)
    830 
    831 13. Add a servlet that implements A9's OpenSearch RSS web service.
    832     (cutting, 20050418)
    833 
    834 14. Remove references to link analysis from tutorial, and enable
    835     scoring by link count when generating fetchlists and searching.
    836     (cutting, 20040419)
    837 
    838 15. Make query boosts for host, title, anchor and phrase matches
    839     configurable.  (Piotr Kosiorowski via cutting, 20050419)
    840 
    841 16. Add support for sorting search results and search-time deduping by
    842     fields other than site.
    843 
    844 17. Automatically convert range queries into cached range filters.
    845     This improves the performance and scalability of, e.g., date range
    846     searching.
    847 
    848 18. Several methods have been renamed due to misspellings.  The old
    849     methods have been deprecated and will be removed before the 1.0
    850     release.
    851 
    852 
    853 Release 0.6
    854 
    855  1. Added clustering-carrot2 plugin, together with introduction of clustering
    856     api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
    857 
    858  2. Make a number of changes to NDFS (Nutch Distributed File System)
    859     to fix bugs, add admin tools, etc.
    860 
    861     Also, modify all command line tools so you can indicate whether to
    862     use NDFS or the local filesystem.  If you indicate nothing, then
    863     it defaults to the local fs.
    864 
    865     I've used this to do a 35m page crawl via NDFS, distributed over a
    866     dozen machines.  (Mike Cafarella)
    867 
    868  3. Add support for BASE tags in HTML.  Outlinks are now correctly
    869     extracted when a BASE tag is present.  (cutting)
    870 
    871  4. Fix two bugs in result pagination.  When the last hit on a page
    872     was the last hit overall, the "next" button was sometimes shown
    873     when the "show all" button should be shown instead.  Also, in
    874     certain cases, the "show all" button would be shown when the
    875     "next" button should have been shown.  (cutting)
    876 
    877  5. Add config parameter "indexer.max.tokens" that determines the
    878     maximum number of tokens indexed per field.  (Andy Hedges via cutting)
    879 
    880  6. Add parser for mp3 files.  (Andy Hedges via cutting)
    881 
    882  7. Add RegexUrlNormalizer.  This is useful for things like stripping
    883     out session IDs from URLs.  To use it, add values for
    884     urlnormalizer.class and urlnormalizer.regex.file to your
    885     nutch-site.xml.  The RegexUrlNormalizer class extends the
    886     BasicUrlNormalizer, and does basic normalization as well.
    887     (Luke Baker via cutting)
    888 
    889  8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
    890 
    891  9. Added Polish translation (Andrzej Bialecki, 20040911)
    892  
    893 10. Added 3 more language profiles to language identifier (ru,hu,pl).
    894   Other changes to language identifier: Porfiles converted to utf8,
    895   added some test cases, changed the similarity calculation.
    896   (Sami Siren, 20040925)
    897 
    898 11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
    899 
    900 12. Added plugin index-more and more.jsp (John Xing, 20041003)
    901 
    902 13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
    903     in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
    904 
    905 14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
    906     (but not search.jsp) with NullPointerException in distributed search.
    907     It seems that this bug appears after "hits per site" stuff is added.
    908     The fix is done in Hit.java, making sure String site is never null.
    909     Hope this fix not have bad effetct on "hits per site" code.
    910     (John Xing, 20041006)
    911 
    912 15. Fixed a bug that fails fullyDelete() in FileUtil.java for
    913     LocalFileSystem.java. This bug also exposes possible incompleteness
    914     of NDFSFile.java, where a few methods are not supported, including
    915     delete(). Nothing changed in NDFSFile.java though. Leave it for future
    916     improvement (John Xing, 20041022).
    917 
    918 16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
    919     A new status code CANT_PARSE is added to FetcherOutput.java.
    920     Without option -noParsing , no change in fetcher behavior. With
    921     option -noParsing, fetcher does crawls only, no parsing is carried out.
    922     Then, ParseSegment.java should be used to parse in separate pass.
    923     (John Xing, 20041025)
    924 
    925 17. Added ontology plugin. Currently it is used for query refinement, as
    926     examplified in refine-query-init.jsp and refine-query.jsp. By default,
    927     query refinement is disabled in search.jsp. Please check
    928     ./src/plugin/ontology/README.txt for further description.
    929     Ontology plugin certainly can be used for many other things.
    930     (Michael J. Pan via John Xing, 20041129)
    931  
    932 18. Changed fetcher.server.delay to be a float, so that sub-second
    933     delays can be specified.  (cutting)
    934 
    935 19. Added plugin.includes config parameter that determines which
    936     plugins are included.  By default now only http, html and basic
    937     indexing and search plugins are enabled, rather than all plugins.
    938     This should make default performance more predictable and reliable
    939     going forward. (cutting)
    940 
    941 20. Cleaned up some filesystem code, including:
    942 
    943     - Replaced BufferedRandomAccessFile with two simpler utilties,
    944       NFSDataInputStream and NFSDataOutputStream.
    945 
    946     - Fixed the bug where SequenceFiles were no longer flushed when
    947       created, so that, when fetches crashed, segments were
    948       unreadable.  Now segments are always readable after crashes.
    949       Only the contents of the last buffer is lost.
    950 
    951     - Simplified the FSOutputStream API to not include seek().  We
    952       should never need that functionality.
    953 
    954     - Simplified LocalFileSystem's implementations of FSInputStream
    955       and FSOutputStream and optimized FSInputStream.seek().
    956 
    957     (cutting)
    958 
    959 21. Fixed BasicUrlNormalizer to better handle relative urls.  The file
    960     part of a URL is normalized in the following manner:
    961 
    962       1. "/aa/../" will be replaced by "/" This is done step by step until
    963    the url doesn´t change anymore. So we ensure, that
    964    "/aa/bb/../../" will be replaced by "/", too
    965 
    966       2. leading "/../" will be replaced by "/"
    967 
    968     (Sven Wende via cutting)
    969 
    970 22. Fix Page constructors so that next fetch date is less likely to be
    971     misconstrued as a float.  This patches a problem in WebDBInjector,
    972     where new pages were added to the db with nextScore set to the
    973     intended nextFetch date.  This, in turn, confused link analysis.
    974 
    975 23. In ndfs code, replace addLocalFile(), putToLocalFile() with
    976     copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
    977     moveToLocalFile(). (John Xing, 20041217)
    978 
    979 24. Added new config parameter fetcher.threads.per.host.  This is used
    980     by the Http protocol.  When this is one behavior is as before.
    981     When this is greater than one then multiple threads are permitted
    982     to access a host at once.  Note that fetcher.server.delay is no
    983     longer consistently observed when this is greater than one.
    984     (Luke Baker via Doug Cutting)
    985 
    986 Release 0.5
    987 
    988  1. Changed plugin directory to be a list of directories.
    989 
    990  2. Permit Plugin to be the default plugin implementation.
    991 
    992  3. Added pluggable interface for network protocols in new package
    993     net.nutch.protocol.  Moved http code from core into a plugin.
    994 
    995  4. Added pluggable interface for content parsing in new package
    996     net.nutch.parse.  Moved html parsing code from core into a
    997     plugin.
    998 
    999  5. Fixed a bug in NutchAnalysis where 16-bit characters were not
    1000     processed correctly.
    1001 
    1002  6. Fixed bug #971731: random summaries on result page.
    1003     (Daniel Naber via cutting)
    1004 
    1005  7. Made Nutch logo transparent. (Daniel Naber via cutting)
    1006 
    1007  8. Added file protocol plugin.  (John Xing via cutting)
    1008 
    1009  9. Added ftp protocol plugin.  (John Xing via cutting)
    1010 
    1011 10. Added pdf and msword parser plugins.  (John Xing via cutting)
    1012 
    1013 11. Added pluggable indexing interface.  By default, url, content,
    1014     anchors and title are indexed, as before, but now one can easily
    1015     alter this to, e.g., index metadata.  A demonstration is provided
    1016     which extracts and indexes Creative Commons license urls. (cutting)
    1017 
    1018 12. Add language identification plugin.
    1019 
    1020     The process of identification is as follows:
    1021 
    1022     1. html (html only, HTML 4.0 "lang" attribute)
    1023     2. meta tags (html only, http-equiv, dc.language)
    1024     3. http header (Content-Language)
    1025     4. if all above fail "statistical analysis"
    1026 
    1027     1 & 2 are run during the fetching phase and 3 & 4 are run on
    1028     indexing phase.
    1029 
    1030     Currently supported languages (in "statistical analysis") are
    1031     da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
    1032     from http://www.isi.edu/~koehn/europarl/ and the profiles were
    1033     build with tool supplied in patch.
    1034 
    1035     After indexing the language can be found from field named "lang"
    1036 
    1037     It's not 100% accurate but it's a start.
    1038     (Sami Siren)
    1039 
    1040 13. Added SegmentMergeTool and "mergesegs" command, to remove
    1041     duplicated or otherwise not used content from several segments and
    1042     joining them together into a single new segment.  The tool also
    1043     optionally performs several other steps required for proper
    1044     operation of Nutch - such as indexing segments, deleting
    1045     duplicates, merging indices, and indexing the new single segment.
    1046     (Andrzej Bialecki)
    1047 
    1048 14. Add the ability to retrieve ParseData of a search hit. ParseData
    1049     contains many valuable properties of a search hit.
    1050 
    1051     This is required (among others) to properly display the cached
    1052     content because it's not possible to determine the character
    1053     encoding from the output of the getContent() method (which returns
    1054     byte[]). The symptoms are that for HTML pages using non-latin1 or
    1055     non-UTF8 encodings the cached preview will almost certainly look
    1056     broken. Using the attached patch it is possible to determine the
    1057     character encoding from the ParseData (for HTTP: Content-Type
    1058     metadata), and encode the content accordingly. (Andrzej Bialecki)
    1059 
    1060 15. Add a pluggable query interface.  By default, the content, anchor
    1061     and url fields are searched as before.  A sample plugin indexes
    1062     the host name and adds a "site:" keyword to query parsing.
    1063 
    1064 16. Added support for "lang:" in queries.  For example, searching with
    1065     "lang:en" restricts results to pages which were identified to
    1066     be in English.
    1067 
    1068 17. Automatically optimize field queries to use cached Lucene filters.
    1069     This makes, for example, searches restricted by languages or sites
    1070     that are very common much faster.
    1071 
    1072 18. Improved charset handling in jsp pages.  (jshin by cutting)
    1073 
    1074 19. Permit topic filtering when injecting DMOZ pages.  (jshin by cutting)
    1075 
    1076 20. When parsing crawled pages, interpret charset specifications in
    1077     html meta tags.  (jshin by cutting)
    1078 
    1079 21. Added support for "cc:licensed" in queries, which searches for documents
    1080     released under Creative Commons licenses.  Attributes of the
    1081     license may also be queried, with, e.g., "cc:by" for
    1082     attribution-required licenses, "cc:nc" for non-commercial
    1083     licenses, etc.
    1084 
    1085 22. Relative paths named in plugin.folders are now searched for on the
    1086     classpath.  This makes, e.g., deployment in a war file much simpler.
    1087 
    1088 23. Modifications to Fetcher.java.
    1089 
    1090     1. Make sure it works properly with regard to creation and initialization
    1091     of plugin instances. The problem was that multiple threads race to
    1092     startUp() or shutDown() plugin instances. It was solved by synchronizing
    1093     certain codes in PluginRepository.java and Extension.java.
    1094     (Stefan Groschupf via John Xing)
    1095 
    1096     2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
    1097     may never return (quit) if there are still data or other structures
    1098     (e.g., persistent socket connections) associated with plugins. (John Xing)
    1099    
    1100     3. Fixed one type of Fetcher "hang" problems by monitoring named
    1101     FetcherThreads. If all FetcherThreads are gone (finished),
    1102     Fetcher.java is considered done. The problem was: there could be
    1103     runaway threads started by external libs via FetcherThreads.
    1104     Those threads never return, thus keep Fetcher from exiting normally.
    1105     (John Xing)
    1106 
    1107 24. Eliminate excessive hits from sites.  This is done efficiently by
    1108     adding the site name to Hit instances, and, when needed,
    1109     re-querying with too-frequent sites prohibited in the query.
    1110 
    1111 
    1112 Release 0.4
    1113 
    1114  1. Http class refactored.  (Kevin Smith via Tom Pierce)
    1115 
    1116  2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
    1117 
    1118  3. Added Japanese translation. (Yukio Andoh via Doug Cutting)
    1119 
    1120  4. Updated Dutch translation. (Ype Kingma via Doug Cutting)
    1121 
    1122  5. Initial version of Distributed DB code.  (Mike Cafarella)
    1123 
    1124  6. Make things more tolerant of crashed fetcher output files.
    1125     (Doug Cutting)
    1126 
    1127  7. New skin for website. (Frank Henze via Doug Cutting)
    1128 
    1129  8. Added Spanish translation. (Diego Basch via Doug Cutting)
    1130 
    1131  9. Add FTP support to fetcher.  (John Xing via Doug Cutting)
    1132 
    1133 10. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
    1134 
    1135 11. Added Robots.txt & throttling support to Fetcher.java.  (Mike
    1136     Cafarella)
    1137 
    1138 12. Added nightly build. (Doug Cutting)
    1139 
    1140 13. Default all link scores to 1.0. (Doug Cutting)
    1141 
    1142 14. Permit one to keep internal links. (Doug Cutting)
    1143 
    1144 15. Fixed dedup to select shortest URL. (Doug Cutting)
    1145 
    1146 16. Changed index merger so that merged index is written to named
    1147     directory, rather than to a generated name in that directory.
    1148     (Doug Cutting)
    1149 
    1150 17. Disable coordination weighting of query clauses and other minor
    1151     scoring improvements. (Doug Cutting)
    1152 
    1153 18. Added a new command, crawl, that constructs a database, injects a
    1154     url file and performs a few rounds of generate/fetch/updatedb.
    1155     This simplifies use for intranet sites.  Changed some defaults to
    1156     be more intranet friendly.  (Doug Cutting)
    1157 
    1158 19. Fixed a bug where Fetcher.java didn't construct correct relative
    1159     links when a page was redirected.  (Doug Cutting)
    1160 
    1161 20. Fixed a query parser problem with lookahead over plusses and minuses.
    1162     (Doug Cutting)
    1163 
    1164 21. Add support for HTTP proxy servers.  (Sami Siren via Doug Cutting)
    1165 
    1166 22. Permit searching while fetching and/or indexing.
    1167     (Sami Siren via Doug Cutting)
    1168 
    1169 23. Fix a bug when throttling is disabled.  (Sami Siren via Doug Cutting)
    1170 
    1171 24. Updated Bahasa Malaysia translation.  (Michael Lim via Doug Cutting)
    1172 
    1173 25. Added Catalan translation.  (Xavier Guardiola via Doug Cutting)
    1174 
    1175 26. Added brazilian portuguese translation.
    1176     (A. Moreir via Doug Cutting)
    1177 
    1178 27. Added a french translation.  (Julien Nioche via Doug Cutting)
    1179 
    1180 28. Updated to Lucene 1.4RC3.  (Doug Cutting)
    1181 
    1182 29. Add capability to boost by link count & use it in crawl tool.
    1183     (Doug Cutting)
    1184 
    1185 30. Added plugin system.  (Stefan Groschupf via Doug Cutting)
    1186 
    1187 31. Add this change log file, for recording significant changes to
    1188     Nutch.  Populate it with changes from the last few months.
  • nutchez-0.1/debian/control

    r66 r73  
    1 Source: nutch
     1Source: nutchez
    22Section:devel
    33Priority: extra
  • nutchez-0.1/debian/files

    r66 r73  
    1 nutch_1.0-1_i386.deb devel extra
     1nutchez_0.1-1_i386.deb devel extra
  • nutchez-0.1/debian/nutchez.install

    r67 r73  
    66tomcat    opt/nutch
    77plugins   opt/nutch
    8 urls    opt/nutch
    98*.jar   opt/nutch
    109*.job   opt/nutch
Note: See TracChangeset for help on using the changeset viewer.