Changeset 73
- Timestamp:
- Jun 2, 2009, 5:39:43 PM (16 years ago)
- Location:
- nutchez-0.1
- Files:
-
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
nutchez-0.1/CHANGES.txt
r66 r73 1 NutchChange Log1 nutchez Change Log 2 2 3 Release 1.0 - 2009-03-234 5 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)6 7 2. NUTCH-443 - Allow parsers to return multiple Parse objects.8 (Dogacan Guney et al, via ab)9 10 3. NUTCH-393 - Indexer should handle null documents returned by filters.11 (Eelco Lempsink via ab)12 13 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)14 15 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other16 bots in robots.txt (Dogacan Guney via siren)17 18 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)19 20 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin21 (siren)22 23 8. NUTCH-161 - Change Plain text parser to24 use parser.character.encoding.default property for fall back encoding25 (KuroSaka TeruHiko, siren)26 27 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of28 unmodified content. (ab)29 30 10. NUTCH-392 - OutputFormat implementations should pass on Progressable.31 (cutting via ab)32 33 11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)34 35 12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed36 up the rss parser (dogacan via mattmann). This update is a fix and semantics37 change from the original patch for NUTCH-443. The original patch did not tell38 the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch39 datums. This patch addresses that issue. Now, if Fetcher gets a null content,40 instead of pushing an empty content, it filters the null content.41 42 13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of43 Parse object. (Gal Nitzan via dogacan)44 45 14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains46 some query parameters. (Emmanuel Joke via dogacan)47 48 15. NUTCH-502 - Bug in SegmentReader causes infinite loop.49 (Ilya Vishnevsky via dogacan)50 51 16. NUTCH-444 Possibly use a different library to parse RSS feed for improved52 performance and compatibility. This patch introduced a new plugin, feed,53 that includes an index filter and a parse plugin for feeds that uses ROME.54 There was discussion to remove parse-rss, in light of the feed plugin,55 however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)56 57 17. NUTCH-471 - Fix synchronization in NutchBean creation.58 (Enis Soztutar via dogacan)59 60 18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)61 62 19. NUTCH-468 - Scoring filter should distribute score to all outlinks at63 once. (dogacan)64 65 20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)66 67 21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in68 DomContentUtils...Spider Trap. (kubes)69 70 22. NUTCH-434 - Replace usage of ObjectWritable with something based on71 GenericWritable. (dogacan)72 73 23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)74 75 24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.76 (Espen Amble Kolstad via dogacan)77 78 25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.79 (Emmanuel Joke via dogacan)80 81 26. NUTCH-503 - Generator exits incorrectly for small fetchlists.82 (Vishal Shah via dogacan)83 84 27. NUTCH-505 - Outlink urls should be validated. (dogacan)85 86 28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)87 88 29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)89 90 30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)91 92 30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)93 94 31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).95 96 32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.97 (Enis Soztutar via dogacan)98 99 33. NUTCH-516 - Next fetch time is not set when it is a100 CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)101 102 34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException103 when trying to rerun dedup on a segment. (Vishal Shah via dogacan)104 105 35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.106 (dogacan) Note: There is a bigger problem, i.e how to deal107 with redirected pages, and this issue can be considered as a band-aid108 for the time being. See NUTCH-273 and NUTCH-353 for more details.109 110 36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and111 inlinks list. (Emmanuel Joke via dogacan)112 113 37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during114 parse. (dogacan)115 116 38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)117 118 39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)119 120 40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds121 domain-related utilities. (Enis Soztutar via dogacan)122 123 41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable124 release (2.1). (Dawid Weiss via dogacan)125 126 42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every127 request. (Dawid Weiss via dogacan)128 129 43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.130 (Emmanuel Joke via dogacan)131 132 44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)133 134 45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)135 136 46. NUTCH-554 - Generator throws IOException on invalid urls.137 (Brian Whitman via ab)138 139 47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.140 (Emmanuel Joke via dogacan)141 142 48. NUTCH-25 - needs 'character encoding' detector.143 (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)144 145 49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated146 to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)147 148 50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.149 (mattmann)150 151 51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink152 list. (Emmanuel Joke, Marcin Okraszewski via kubes)153 154 52. NUTCH-501 - Implement a different caching mechanism for objects cached in155 configuration. (dogacan)156 157 53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)158 159 54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)160 161 55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.162 (dogacan, kubes via dogacan)163 164 56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.165 (Emmanuel Joke via dogacan)166 167 57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)168 169 58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)170 171 59. NUTCH-574 - Including inlink anchor text in index can create irrelevant172 search results. Created index-anchor plugin, removed functionality from173 index-basic plugin. For backwards compatibility, add index-anchor plugin to174 nutch-site.xml plugin.includes. (kubes)175 176 60. NUTCH-581 - DistributedSearch does not update search servers added to177 search-servers.txt on the fly. (Rohan Mehta via kubes)178 179 61. NUTCH-586 - Add option to run compiled classes without job file180 (enis via ab)181 182 62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy183 server. (Susam Pal via dogacan)184 185 63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)186 187 64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format188 (Emmanuel Joke via ab)189 190 65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)191 192 66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)193 194 67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)195 196 68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)197 198 69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)199 200 70. NUTCH-602 - Allow configurable number of handlers for search servers201 (hartbecke via kubes)202 203 71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)204 205 72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)206 207 73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)208 209 74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)210 211 75. NUTCH-603 - Add more default url normalizations (kubes)212 213 76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)214 215 77. NUTCH-44 - Too many search results, limits max results returned from a216 single search. (Emilijan Mirceski and Susam Pal via kubes)217 218 78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is219 updated to 1.2 version. (dogacan)220 221 79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)222 223 80. NUTCH-612 - URL filtering was disabled in Generator when invoked224 from Crawl (Susam Pal via ab)225 226 81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)227 228 82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)229 230 83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)231 232 84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.233 Guard against reprUrl being null. (Emmanuel Joke, ab)234 235 85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel236 Joke, ab)237 238 86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)239 240 87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)241 242 88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.243 (Emmanuel Joke, dogacan, ab)244 245 89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a246 single slash. (Mark DeSpain via ab)247 248 90. NUTCH-500 - Add hadoop masters configuration file into conf folder.249 (Emmanuel Joke via kubes)250 251 91. NUTCH-596 - ParseSegments parse content even if its not252 CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)253 254 92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)255 256 93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln257 Ritter, ab)258 259 94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)260 261 95. NUTCH-645 - Parse-swf unit test failing (ab)262 263 96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)264 265 97. NUTCH-639 - Change LuceneDocumentWrapper visibility from266 private to _public_ (Guillaume Smet via dogacan)267 268 98. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn269 tracking. (dogacan)270 271 99. NUTCH-375 - Add support for Content-Encoding: deflated272 (Pascal Beis, ab)273 274 100. NUTCH-633 - ParseSegment no longer allow reparsing.275 (dogacan)276 277 101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)278 279 102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)280 281 103. NUTCH-654 - urlfilter-regex's main does not work.282 (dogacan)283 284 104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".285 (dogacan)286 287 105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)288 289 106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)290 291 107. NUTCH-647 - Resolve URLs tool (kubes)292 293 108. NUTCH-665 - Search Load Testing Tool (kubes)294 295 109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming296 (kubes)297 298 110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes)299 300 111. NUTCH-646 - New Indexing Framework for Nutch. (kubes)301 302 112. NUTCH-668 - Domain URL Filter. (kubes)303 304 113. NUTCH-594 - Serve Nutch search results in multiple formats including305 XML and JSON. (kubes)306 307 114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)308 309 115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate310 fetch interval correctly. (dogacan)311 312 116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)313 314 117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.315 (julien nioche via dogacan)316 317 118. NUTCH-681 - parse-mp3 compilation problem.318 (Wildan Maulana via dogacan)319 320 119. NUTCH-676 - MapWritable is written inefficiently and confusingly.321 (dogacan)322 323 120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical324 digest. (dogacan)325 326 121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.327 (Joseph Chen, dogacan)328 329 122. NUTCH-682 - SOLR indexer does not set boost on the document.330 (julien nioche via dogacan)331 332 123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)333 334 124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)335 336 125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)337 338 126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE339 (Curtis d'Entremont, ab)340 341 127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)342 343 128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException344 (Stefan Will, siren)345 346 129. NUTCH-691 - Update jakarta poi jars to the most relevant version347 (Dmitry Lihachev via siren)348 349 130. NUTCH-563 - Include custom fields in BasicQueryFilter350 (Julien Nioche via siren)351 352 131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin353 (Dmitry Lihachev via siren)354 355 132. NUTCH-694 - Distributed Search Server fails (siren)356 357 133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links358 set at cross domain redirects (Remco Verhoef, dogacan via siren)359 360 134. NUTCH-247 - Robot parser to restrict (kubes, siren)361 362 135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan363 via siren)364 365 136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,366 Dmitry Lihachev via siren)367 368 137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)369 370 138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,371 Doug Cook via ab)372 373 139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)374 375 140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)376 377 141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)378 379 142. NUTCH-684 - Dedup support for Solr. (dogacan)380 381 143. NUTCH-715 - Subcollection plugin doesn't work with default382 subcollections.xml file (Dmitry Lihachev via siren)383 384 144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute385 386 Release 0.9 - 2007-04-02387 388 1. Changed log4j confiquration to log to stdout on commandline389 tools (siren)390 391 2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)392 393 3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,394 siren)395 396 4. Optionally skip pages with abnormally large values of Crawl-Delay397 (Dennis Kubes via ab)398 399 5. Change readdb -stats to use CombiningCollector (ab)400 401 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris402 Schneider and Stefan Groschupf via ab)403 404 7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying405 dependant jars (siren)406 407 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files408 in parse-plugins.xml (Chris A. Mattmann via siren)409 410 9. NUTCH-105 - Network error during robots.txt fetch causes file to411 be ignored (Greg Kim via siren)412 413 10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)414 415 11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing416 to the current page (e.g. anchors). (Stefan Groschupf via ab)417 418 12. NUTCH-365 - Flexible URL normalization (ab)419 420 13. NUTCH-336 - Differentiate between newly discovered pages and newly421 injected pages (Chris Schneider via ab) NOTE: this changes the422 scoring API, filter implementations need to be updated.423 424 14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf425 via ab)426 427 15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE428 (Stefan Groschupf via ab)429 430 16. NUTCH-374 - when http.content.limit be set to -1 and431 Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing432 (King Kong via pkosiorowski)433 434 17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)435 436 ****************************** WARNING !!! ********************************437 * This upgrade breaks data format compatibility. A tool 'convertdb' *438 * was added to migrate existing CrawlDb-s to the new format. Segment data *439 * can be partially migrated using 'mergesegs', however segments will *440 * require re-parsing (and consequently re-indexing). *441 ****************************** WARNING !!! ********************************442 443 18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of444 the algorithm. (ab)445 446 19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot447 find parser (siren)448 449 20. NUTCH-379 - ParseUtil does not pass through the content's URL to the450 ParserFactory (Chris A. Mattmann via siren)451 452 21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one453 partition. (ab)454 455 22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)456 457 23. NUTCH-395 - Increase fetching speed (siren)458 459 24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order460 (reported by Jared Dunne)461 462 25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)463 464 26. NUTCH-403 - Make URL filtering optional in Generator (siren)465 466 27. NUTCH-405 - Content object is not properly initialized in map method467 of ParseSegment (siren)468 469 28. NUTCH-362 - Remove parse-text from unsupported filetypes in470 parse-plugins.xml (siren)471 472 29. NUTCH-305 - Update crawl and url filter lists to exclude473 jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan474 Neufeind) is also updated (siren)475 476 30. NUTCH-406 - Metadata tries to write null values (mattmann)477 478 31. NUTCH-415 - Generator should mark selected records in CrawlDb.479 Due to increased resource consumption this step is optional.480 Application-level locking has been added to prevent concurrent481 modification of databases. (ab)482 483 32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is484 now possible to correctly update CrawlDb from multiple segments.485 Introduce new status codes for temporary and permanent486 redirection. (ab)487 488 33. NUTCH-322 - Fix Fetcher to store redirected pages and to store489 protocol-level status. This also should fix NUTCH-273. (ab)490 491 34. Change default Fetcher behavior not to follow redirects immediately.492 Instead Fetcher will record redirects as new pages to be added to CrawlDb.493 This also partially addresses NUTCH-273. (ab)494 495 35. Detect and report when Generator creates 0-sized segments. (ab)496 497 36. Fix Injector to preserve already existing CrawlDatum if the seed list498 being injected also contains such URL. (ab)499 500 37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after501 skipping bad URLs. (Michael Stack via ab)502 503 38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains504 Filters that are not in plugin.includes (Stefan Groschupf, siren)505 506 39. NUTCH-421 - Allow predeterminate running order of indexing filters507 (Alan Tanaman, siren)508 509 40. When indexing pages with redirection, drop all intermediate pages and510 index only the final page. (ab)511 512 41. Upgrade to Hadoop 0.10.1. (ab)513 514 42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the515 order in which IndexDoc-s are processed. (Dogacan Guney via ab)516 517 43. NUTCH-428 - NullPointerException thrown when agent name is not518 configured properly. Changed to throw RuntimeException instead.519 (siren)520 521 44. NUTCH-430 - Integer overflow in HashComparator.compare (siren)522 523 45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)524 525 46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs526 or indexing from hadoop.io.DataOutputBuffer (siren)527 528 47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)529 530 48. NUTCH-390 - Javadoc warnings (mattmann)531 532 49. NUTCH-449 - Make junit output format configurable. (nigel via cutting)533 534 50. NUTCH-432 - Fix a bug where platform name with spaces would break the535 bin/nutch script. (Brian Whitman via ab)536 537 51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)538 539 52. NUTCH-167 - Observation of robots "noarchive" directive. (ab)540 541 53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins542 framework to operate properly (Heiko Dietze via mattmann)543 544 54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan545 Groschupf via kubes)546 547 55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL548 path is empty (kubes)549 550 56. Upgrade to Hadoop 0.12.1 release. (ab)551 552 57. NUTCH-246 - Incorrect segment size being generated due to time553 synchronization issue (Stefan Groschupf via ab)554 555 58. Upgrade to Hadoop 0.12.2 release. (ab)556 557 59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael558 Stack and Dogacan Guney via kubes)559 560 Release 0.8 - 2006-07-25561 562 0. Totally new architecture, based on hadoop563 [http://lucene.apache.org/hadoop] (cutting)564 565 1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).566 567 2. NUTCH-108 - Log hosts that exceed generate.max.per.host.568 (Rod Taylor via cutting)569 570 3. NUTCH-88 - Enhance ParserFactory plugin selection policy571 (jerome)572 573 4. NUTCH-124 - Protocol-httpclient does not follow redirects when574 fetching robots.txt (cutting)575 576 5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)577 (stack@archive.org, cutting)578 579 6. NUTCH-114 - Getting number of urls and links from crawldb580 (Stefan Groschupf via ab)581 582 7. NUTCH-112 - Link in cached.jsp page to cached content is an583 absolute link (Chris A. Mattmann via jerome)584 585 8. NUTCH-135 - Http header meta data are case insensitive in the586 real world (Stefan Groschupf via jerome)587 588 9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due589 to UTF-8 BOM (KuroSaka TeruHiko via siren)590 591 10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)592 593 11. Added support for OpenSearch (cutting)594 595 12. NUTCH-142 - NutchConf should use the thread context classloader596 (Mike Cannon-Brookes via pkosiorowski)597 598 13. NUTCH-160 - Use standard Java Regex library rather than599 org.apache.oro.text.regex (Rod Taylor via cutting)600 601 14. NUTCH-151 - CommandRunner can hang after the main thread exec is602 finished and has inefficient busy loop (Paul Baclace via cutting)603 604 15. NUTCH-174 - Problem encountered with ant during compilation605 606 16. NUTCH-190 - ParseUtil drops reason for failed parse607 (stack@archive.org via ab)608 609 17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)610 611 18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)612 613 19. NUTCH-178 - in search.jsp must be session creation "false"614 (YourSoft via siren)615 616 20. NUTCH-200 - OpenSearch Servlet ist broken617 (Marko Bauhardt via siren)618 619 21. NUTCH-81 - Webapp only works when deployed in root620 (AJ Banck, Michael Nebel via siren)621 622 22. NUTCH-139 - Standard metadata property names in the ParseData623 metadata (Chris A. Mattmann, jerome)624 625 23. NUTCH-192 - Meta data support for CrawlDatum626 (Stefan Groschupf via ab)627 628 24. NUTCH-52 - Parser plugin for MS Excel files629 (Rohit Kulkarni via jerome)630 631 25. NUTCH-53 - Parser plugin for Zip files632 (Rohit Kulkarni via jerome)633 634 26. NUTCH-137 - footer is not displayed in search result page635 (KuroSaka TeruHiko via siren)636 637 27. NUTCH-118 - FAQ link points to invalid URL638 (Steve Betts via siren)639 640 28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)641 translation (Ivan Sekulovic via siren)642 643 29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf644 via cutting)645 646 30. NUTCH-140 - Add alias capability in parse-plugins.xml file that647 allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)648 649 31. NUTCH-214 - Added Links to web site to search mailling list650 (Jake Vanderdray via jerome)651 652 32. NUTCH-204 - Multiple field values in HitDetails653 (Stefan Groschupf via jerome)654 655 33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed656 to -1 to be consistent with http (jerome)657 658 34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)659 660 35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via661 pkosiorowski)662 663 36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via664 jerome)665 666 37. NUTCH-229 - Improved handling of plugin folder configuration667 (Stefan Groschupf via ab)668 669 38. NUTCH-206 - Search server throws InstantiationException (ab)670 671 39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt672 via ab)673 674 40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)675 676 41. Update to lucene 1.9.1 (cutting)677 678 42. NUTCH-235 - Duplicate Inlink values (ab)679 680 43. NUTCH-234 - Clustering extension code cleanups and a real681 JUnit test case for the current implementation (Dawid Weiss via ab)682 683 44. NUTCH-210 - Context.xml file for Nutch web application684 (Chris A. Mattmann via jerome)685 686 45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)687 688 46. NUTCH-232 - Search.jsp has multiple search forms creating689 invalid html / incorrect focus function (jerome)690 691 47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)692 693 48. NUTCH-244 - Inconsistent handling of property values694 boundaries / unable to set db.max.outlinks.per.page to695 infinite (jerome)696 697 49. NUTCH-245 - DTD for plugin.xml configuration files698 (Chris A. Mattmann via jerome)699 700 50. NUTCH-250 - Generate to log truncation caused by701 generate.max.per.host (Rod Taylor via cutting)702 703 51. NUTCH-125 - OpenOffice Parser plugin (ab)704 705 52. Switch from using java.io.File to org.apache.hadoop.fs.Path.706 (cutting)707 708 53. NUTCH-240 - Scoring API: extension point, scoring filters and709 an OPIC plugin (ab)710 711 54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)712 713 55. NUTCH-268 - Generator and lib-http use different definitions of714 "unique host" (ab)715 716 56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser717 via siren)718 719 57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories720 (Dennis Kubes via ab)721 722 58. NUTCH-201 - Add support for subcollections723 (siren)724 725 59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown726 (Stefan Groschupf via jerome)727 728 60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)729 730 61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query731 (Stefan Groschupf via jerome)732 733 62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters734 (stack@archive.org via siren)735 736 63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space737 (Stefan Neufeind via siren)738 739 64. NUTCH-307 - Wrong configured log4j.properties (jerome)740 741 65. NUTCH-303 - Logging improvements (jerome)742 743 66. NUTCH-308 - Maximum search time limit (ab)744 745 67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency746 problem (Grant Glouser via siren)747 748 68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)749 750 69. NUTCH-317 - Clarify what the queryLanguage argument of751 Query.parse(...) means (jerome)752 753 70. Added alternative experimental web gui in contrib containing754 extensions like subcollection, keymatch, user preferences,755 caching, implemented mainly using tiles and jstl (siren)756 757 71. NUTCH-320 DmozParser does not output list of urls to stdout758 but to a log file instead. Original functionality restored.759 760 72. NUTCH-271 - Add ability to limit crawling to the set of initially761 injected hosts (db.ignore.external.links) (Philippe Eugene,762 Stefan Neufeind via ab)763 764 73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)765 766 74. NUTCH-327 - Fixed logging directory on cygwin (siren)767 768 Release 0.7 - 2005-08-17769 770 1. Added support for "type:" in queries. Search results are limited/qualified771 by mimetype or its primary type or sub type. For example,772 (1) searching with "type:application/pdf" restricts results773 to pages which were identified to be of mimetype "application/pdf".774 (2) with "type:application", nutch will return pages of775 primary type "application".776 (3) with "type:pdf", only pages of sub type "pdf" will be listed.777 (John Xing, 20050120)778 779 2. Added support for "date:" in queries. Last-Modified is indexed.780 Search results are restricted by lower and upper date (inclusive)781 as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231782 only returns pages with Last-Modified in year 2004.783 (John Xing, 20050122)784 785 3. Add URLFilter plugin interface and convert existing url filters into786 plugins. (John Xing, 20050206)787 788 4. Add UpdateSegmentsFromDb tool, which updates the scores and789 anchors of existing segments with the current values in the web790 db. This is used by CrawlTool, so that pages are now only fetched791 once per crawl. (Doug Cutting, 20050221)792 793 5. Moved code into org.apache.nutch sub-packages. Changed license to794 Apache 2.0. Removed jar files whose licenses do not permit795 redistribution by Apache. Disabled compilation of plugins which796 require these libraries. (Doug Cutting 20050301)797 798 6. Index host and title in separate fields. Host was indexed799 previously only as a part of the URL. Title was indexed as an800 anchor. Now boosts for matching these fields may be adjusted801 separately from boosts for matching anchors and url. Also: move802 site indexing to index-basic plugin to minimize the number of803 times the URL needs to be parsed; and, stop using anchor analyzer804 for anything but anchors. (Piotr Kosiorowski via Doug Cutting805 20050323)806 807 7. Add servlet Cached.java that serves cached Content of any mime type.808 Slightly modified are web.xml and cached.jsp.809 (John Xing, 20050401)810 811 8. Add skipCompressedByteArray() to WritableUtils.java.812 (John Xing, 20050402)813 814 9. Fixes to jsp and static web pages. These now use relative links,815 so that the Nutch webapp file can be used in places other than at816 the root. Also fixed links to the about and help pages. Bug #32.817 (Jerome Charron via cutting, 20050404)818 819 10. Added some features to DistributedSearch: new segments can be added820 to searchservers without restarting the frontend, defective search821 servers are not queried until tey come back online, watchdog keeps822 an eye for your searchservers and writes simple statistics.823 (Sami Siren, 20050407)824 825 11. Fix for bug #4 - Unbalanced quote in query eats all resources.826 (Piotr Kosiorowski, Sami Siren, 20050407)827 828 12. Close Issue #33 - MIME content type detector (using magic char sequences).829 (Jerome Charron and Hari Kodungallur via John Xing, 20050416)830 831 13. Add a servlet that implements A9's OpenSearch RSS web service.832 (cutting, 20050418)833 834 14. Remove references to link analysis from tutorial, and enable835 scoring by link count when generating fetchlists and searching.836 (cutting, 20040419)837 838 15. Make query boosts for host, title, anchor and phrase matches839 configurable. (Piotr Kosiorowski via cutting, 20050419)840 841 16. Add support for sorting search results and search-time deduping by842 fields other than site.843 844 17. Automatically convert range queries into cached range filters.845 This improves the performance and scalability of, e.g., date range846 searching.847 848 18. Several methods have been renamed due to misspellings. The old849 methods have been deprecated and will be removed before the 1.0850 release.851 852 853 Release 0.6854 855 1. Added clustering-carrot2 plugin, together with introduction of clustering856 api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)857 858 2. Make a number of changes to NDFS (Nutch Distributed File System)859 to fix bugs, add admin tools, etc.860 861 Also, modify all command line tools so you can indicate whether to862 use NDFS or the local filesystem. If you indicate nothing, then863 it defaults to the local fs.864 865 I've used this to do a 35m page crawl via NDFS, distributed over a866 dozen machines. (Mike Cafarella)867 868 3. Add support for BASE tags in HTML. Outlinks are now correctly869 extracted when a BASE tag is present. (cutting)870 871 4. Fix two bugs in result pagination. When the last hit on a page872 was the last hit overall, the "next" button was sometimes shown873 when the "show all" button should be shown instead. Also, in874 certain cases, the "show all" button would be shown when the875 "next" button should have been shown. (cutting)876 877 5. Add config parameter "indexer.max.tokens" that determines the878 maximum number of tokens indexed per field. (Andy Hedges via cutting)879 880 6. Add parser for mp3 files. (Andy Hedges via cutting)881 882 7. Add RegexUrlNormalizer. This is useful for things like stripping883 out session IDs from URLs. To use it, add values for884 urlnormalizer.class and urlnormalizer.regex.file to your885 nutch-site.xml. The RegexUrlNormalizer class extends the886 BasicUrlNormalizer, and does basic normalization as well.887 (Luke Baker via cutting)888 889 8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)890 891 9. Added Polish translation (Andrzej Bialecki, 20040911)892 893 10. Added 3 more language profiles to language identifier (ru,hu,pl).894 Other changes to language identifier: Porfiles converted to utf8,895 added some test cases, changed the similarity calculation.896 (Sami Siren, 20040925)897 898 11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)899 900 12. Added plugin index-more and more.jsp (John Xing, 20041003)901 902 13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced903 in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)904 905 14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp906 (but not search.jsp) with NullPointerException in distributed search.907 It seems that this bug appears after "hits per site" stuff is added.908 The fix is done in Hit.java, making sure String site is never null.909 Hope this fix not have bad effetct on "hits per site" code.910 (John Xing, 20041006)911 912 15. Fixed a bug that fails fullyDelete() in FileUtil.java for913 LocalFileSystem.java. This bug also exposes possible incompleteness914 of NDFSFile.java, where a few methods are not supported, including915 delete(). Nothing changed in NDFSFile.java though. Leave it for future916 improvement (John Xing, 20041022).917 918 16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.919 A new status code CANT_PARSE is added to FetcherOutput.java.920 Without option -noParsing , no change in fetcher behavior. With921 option -noParsing, fetcher does crawls only, no parsing is carried out.922 Then, ParseSegment.java should be used to parse in separate pass.923 (John Xing, 20041025)924 925 17. Added ontology plugin. Currently it is used for query refinement, as926 examplified in refine-query-init.jsp and refine-query.jsp. By default,927 query refinement is disabled in search.jsp. Please check928 ./src/plugin/ontology/README.txt for further description.929 Ontology plugin certainly can be used for many other things.930 (Michael J. Pan via John Xing, 20041129)931 932 18. Changed fetcher.server.delay to be a float, so that sub-second933 delays can be specified. (cutting)934 935 19. Added plugin.includes config parameter that determines which936 plugins are included. By default now only http, html and basic937 indexing and search plugins are enabled, rather than all plugins.938 This should make default performance more predictable and reliable939 going forward. (cutting)940 941 20. Cleaned up some filesystem code, including:942 943 - Replaced BufferedRandomAccessFile with two simpler utilties,944 NFSDataInputStream and NFSDataOutputStream.945 946 - Fixed the bug where SequenceFiles were no longer flushed when947 created, so that, when fetches crashed, segments were948 unreadable. Now segments are always readable after crashes.949 Only the contents of the last buffer is lost.950 951 - Simplified the FSOutputStream API to not include seek(). We952 should never need that functionality.953 954 - Simplified LocalFileSystem's implementations of FSInputStream955 and FSOutputStream and optimized FSInputStream.seek().956 957 (cutting)958 959 21. Fixed BasicUrlNormalizer to better handle relative urls. The file960 part of a URL is normalized in the following manner:961 962 1. "/aa/../" will be replaced by "/" This is done step by step until963 the url doesn´t change anymore. So we ensure, that964 "/aa/bb/../../" will be replaced by "/", too965 966 2. leading "/../" will be replaced by "/"967 968 (Sven Wende via cutting)969 970 22. Fix Page constructors so that next fetch date is less likely to be971 misconstrued as a float. This patches a problem in WebDBInjector,972 where new pages were added to the db with nextScore set to the973 intended nextFetch date. This, in turn, confused link analysis.974 975 23. In ndfs code, replace addLocalFile(), putToLocalFile() with976 copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and977 moveToLocalFile(). (John Xing, 20041217)978 979 24. Added new config parameter fetcher.threads.per.host. This is used980 by the Http protocol. When this is one behavior is as before.981 When this is greater than one then multiple threads are permitted982 to access a host at once. Note that fetcher.server.delay is no983 longer consistently observed when this is greater than one.984 (Luke Baker via Doug Cutting)985 986 Release 0.5987 988 1. Changed plugin directory to be a list of directories.989 990 2. Permit Plugin to be the default plugin implementation.991 992 3. Added pluggable interface for network protocols in new package993 net.nutch.protocol. Moved http code from core into a plugin.994 995 4. Added pluggable interface for content parsing in new package996 net.nutch.parse. Moved html parsing code from core into a997 plugin.998 999 5. Fixed a bug in NutchAnalysis where 16-bit characters were not1000 processed correctly.1001 1002 6. Fixed bug #971731: random summaries on result page.1003 (Daniel Naber via cutting)1004 1005 7. Made Nutch logo transparent. (Daniel Naber via cutting)1006 1007 8. Added file protocol plugin. (John Xing via cutting)1008 1009 9. Added ftp protocol plugin. (John Xing via cutting)1010 1011 10. Added pdf and msword parser plugins. (John Xing via cutting)1012 1013 11. Added pluggable indexing interface. By default, url, content,1014 anchors and title are indexed, as before, but now one can easily1015 alter this to, e.g., index metadata. A demonstration is provided1016 which extracts and indexes Creative Commons license urls. (cutting)1017 1018 12. Add language identification plugin.1019 1020 The process of identification is as follows:1021 1022 1. html (html only, HTML 4.0 "lang" attribute)1023 2. meta tags (html only, http-equiv, dc.language)1024 3. http header (Content-Language)1025 4. if all above fail "statistical analysis"1026 1027 1 & 2 are run during the fetching phase and 3 & 4 are run on1028 indexing phase.1029 1030 Currently supported languages (in "statistical analysis") are1031 da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed1032 from http://www.isi.edu/~koehn/europarl/ and the profiles were1033 build with tool supplied in patch.1034 1035 After indexing the language can be found from field named "lang"1036 1037 It's not 100% accurate but it's a start.1038 (Sami Siren)1039 1040 13. Added SegmentMergeTool and "mergesegs" command, to remove1041 duplicated or otherwise not used content from several segments and1042 joining them together into a single new segment. The tool also1043 optionally performs several other steps required for proper1044 operation of Nutch - such as indexing segments, deleting1045 duplicates, merging indices, and indexing the new single segment.1046 (Andrzej Bialecki)1047 1048 14. Add the ability to retrieve ParseData of a search hit. ParseData1049 contains many valuable properties of a search hit.1050 1051 This is required (among others) to properly display the cached1052 content because it's not possible to determine the character1053 encoding from the output of the getContent() method (which returns1054 byte[]). The symptoms are that for HTML pages using non-latin1 or1055 non-UTF8 encodings the cached preview will almost certainly look1056 broken. Using the attached patch it is possible to determine the1057 character encoding from the ParseData (for HTTP: Content-Type1058 metadata), and encode the content accordingly. (Andrzej Bialecki)1059 1060 15. Add a pluggable query interface. By default, the content, anchor1061 and url fields are searched as before. A sample plugin indexes1062 the host name and adds a "site:" keyword to query parsing.1063 1064 16. Added support for "lang:" in queries. For example, searching with1065 "lang:en" restricts results to pages which were identified to1066 be in English.1067 1068 17. Automatically optimize field queries to use cached Lucene filters.1069 This makes, for example, searches restricted by languages or sites1070 that are very common much faster.1071 1072 18. Improved charset handling in jsp pages. (jshin by cutting)1073 1074 19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting)1075 1076 20. When parsing crawled pages, interpret charset specifications in1077 html meta tags. (jshin by cutting)1078 1079 21. Added support for "cc:licensed" in queries, which searches for documents1080 released under Creative Commons licenses. Attributes of the1081 license may also be queried, with, e.g., "cc:by" for1082 attribution-required licenses, "cc:nc" for non-commercial1083 licenses, etc.1084 1085 22. Relative paths named in plugin.folders are now searched for on the1086 classpath. This makes, e.g., deployment in a war file much simpler.1087 1088 23. Modifications to Fetcher.java.1089 1090 1. Make sure it works properly with regard to creation and initialization1091 of plugin instances. The problem was that multiple threads race to1092 startUp() or shutDown() plugin instances. It was solved by synchronizing1093 certain codes in PluginRepository.java and Extension.java.1094 (Stefan Groschupf via John Xing)1095 1096 2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads1097 may never return (quit) if there are still data or other structures1098 (e.g., persistent socket connections) associated with plugins. (John Xing)1099 1100 3. Fixed one type of Fetcher "hang" problems by monitoring named1101 FetcherThreads. If all FetcherThreads are gone (finished),1102 Fetcher.java is considered done. The problem was: there could be1103 runaway threads started by external libs via FetcherThreads.1104 Those threads never return, thus keep Fetcher from exiting normally.1105 (John Xing)1106 1107 24. Eliminate excessive hits from sites. This is done efficiently by1108 adding the site name to Hit instances, and, when needed,1109 re-querying with too-frequent sites prohibited in the query.1110 1111 1112 Release 0.41113 1114 1. Http class refactored. (Kevin Smith via Tom Pierce)1115 1116 2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)1117 1118 3. Added Japanese translation. (Yukio Andoh via Doug Cutting)1119 1120 4. Updated Dutch translation. (Ype Kingma via Doug Cutting)1121 1122 5. Initial version of Distributed DB code. (Mike Cafarella)1123 1124 6. Make things more tolerant of crashed fetcher output files.1125 (Doug Cutting)1126 1127 7. New skin for website. (Frank Henze via Doug Cutting)1128 1129 8. Added Spanish translation. (Diego Basch via Doug Cutting)1130 1131 9. Add FTP support to fetcher. (John Xing via Doug Cutting)1132 1133 10. Added Thai translation. (Pichai Ongvasith via Doug Cutting)1134 1135 11. Added Robots.txt & throttling support to Fetcher.java. (Mike1136 Cafarella)1137 1138 12. Added nightly build. (Doug Cutting)1139 1140 13. Default all link scores to 1.0. (Doug Cutting)1141 1142 14. Permit one to keep internal links. (Doug Cutting)1143 1144 15. Fixed dedup to select shortest URL. (Doug Cutting)1145 1146 16. Changed index merger so that merged index is written to named1147 directory, rather than to a generated name in that directory.1148 (Doug Cutting)1149 1150 17. Disable coordination weighting of query clauses and other minor1151 scoring improvements. (Doug Cutting)1152 1153 18. Added a new command, crawl, that constructs a database, injects a1154 url file and performs a few rounds of generate/fetch/updatedb.1155 This simplifies use for intranet sites. Changed some defaults to1156 be more intranet friendly. (Doug Cutting)1157 1158 19. Fixed a bug where Fetcher.java didn't construct correct relative1159 links when a page was redirected. (Doug Cutting)1160 1161 20. Fixed a query parser problem with lookahead over plusses and minuses.1162 (Doug Cutting)1163 1164 21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting)1165 1166 22. Permit searching while fetching and/or indexing.1167 (Sami Siren via Doug Cutting)1168 1169 23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting)1170 1171 24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting)1172 1173 25. Added Catalan translation. (Xavier Guardiola via Doug Cutting)1174 1175 26. Added brazilian portuguese translation.1176 (A. Moreir via Doug Cutting)1177 1178 27. Added a french translation. (Julien Nioche via Doug Cutting)1179 1180 28. Updated to Lucene 1.4RC3. (Doug Cutting)1181 1182 29. Add capability to boost by link count & use it in crawl tool.1183 (Doug Cutting)1184 1185 30. Added plugin system. (Stefan Groschupf via Doug Cutting)1186 1187 31. Add this change log file, for recording significant changes to1188 Nutch. Populate it with changes from the last few months. -
nutchez-0.1/debian/control
r66 r73 1 Source: nutch 1 Source: nutchez 2 2 Section:devel 3 3 Priority: extra -
nutchez-0.1/debian/files
r66 r73 1 nutch _1.0-1_i386.deb devel extra1 nutchez_0.1-1_i386.deb devel extra -
nutchez-0.1/debian/nutchez.install
r67 r73 6 6 tomcat opt/nutch 7 7 plugins opt/nutch 8 urls opt/nutch9 8 *.jar opt/nutch 10 9 *.job opt/nutch
Note: See TracChangeset
for help on using the changeset viewer.