source: nutchez-0.1/CHANGES.txt @ 66

Last change on this file since 66 was 66, checked in by waue, 15 years ago

NutchEz - an easy way to nutch

File size: 43.8 KB
Line 
1Nutch Change Log
2
3Release 1.0 - 2009-03-23
4
5 1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
6
7 2. NUTCH-443 - Allow parsers to return multiple Parse objects.
8    (Dogacan Guney et al, via ab)
9
10 3. NUTCH-393 - Indexer should handle null documents returned by filters.
11    (Eelco Lempsink via ab)
12
13 4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
14
15 5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
16    bots in robots.txt (Dogacan Guney via siren)
17
18 6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
19 
20 7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
21    (siren)
22
23 8. NUTCH-161 - Change Plain text parser to
24    use parser.character.encoding.default property for fall back encoding
25    (KuroSaka TeruHiko, siren)
26
27 9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
28    unmodified content. (ab)
29
3010. NUTCH-392 - OutputFormat implementations should pass on Progressable.
31    (cutting via ab)
32
3311. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
34
3512. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
36    up the rss parser (dogacan via mattmann). This update is a fix and semantics
37    change from the original patch for NUTCH-443. The original patch did not tell
38    the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
39    datums. This patch addresses that issue. Now, if Fetcher gets a null content,
40    instead of pushing an empty content, it filters the null content.
41   
4213. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
43    Parse object. (Gal Nitzan via dogacan)
44
4514. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
46    some query parameters. (Emmanuel Joke via dogacan)
47
4815. NUTCH-502 - Bug in SegmentReader causes infinite loop.
49    (Ilya Vishnevsky via dogacan)
50   
5116. NUTCH-444 Possibly use a different library to parse RSS feed for improved
52    performance and compatibility. This patch introduced a new plugin, feed,
53    that includes an index filter and a parse plugin for feeds that uses ROME.
54    There was discussion to remove parse-rss, in light of the feed plugin,
55    however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
56
5717. NUTCH-471 - Fix synchronization in NutchBean creation.
58    (Enis Soztutar via dogacan)
59
6018. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
61
6219. NUTCH-468 - Scoring filter should distribute score to all outlinks at
63    once. (dogacan)
64
6520. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
66
6721. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in
68  DomContentUtils...Spider Trap. (kubes)
69
7022. NUTCH-434 - Replace usage of ObjectWritable with something based on
71    GenericWritable. (dogacan)
72
7323. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
74
7524. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
76    (Espen Amble Kolstad via dogacan)
77
7825. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
79    (Emmanuel Joke via dogacan)
80
8126. NUTCH-503 - Generator exits incorrectly for small fetchlists.
82    (Vishal Shah via dogacan)
83
8427. NUTCH-505 - Outlink urls should be validated. (dogacan)
85
8628. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
87
8829. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
89
9030. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
91
9230. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
93
9431. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
95
9632. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
97    (Enis Soztutar via dogacan)
98
9933. NUTCH-516 - Next fetch time is not set when it is a
100    CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
101
10234. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
103    when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
104
10535. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
106    (dogacan) Note: There is a bigger problem, i.e how to deal
107    with redirected pages, and this issue can be considered as a band-aid
108    for the time being. See NUTCH-273 and NUTCH-353 for more details.
109
11036. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
111    inlinks list. (Emmanuel Joke via dogacan)
112
11337. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
114    parse. (dogacan)
115
11638. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
117
11839. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
119
12040. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
121    domain-related utilities. (Enis Soztutar via dogacan)
122
12341. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
124    release (2.1). (Dawid Weiss via dogacan)
125
12642. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
127    request. (Dawid Weiss via dogacan)
128
12943. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
130    (Emmanuel Joke via dogacan)
131
13244. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
133
13445. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
135
13646. NUTCH-554 - Generator throws IOException on invalid urls.
137    (Brian Whitman via ab)
138
13947. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
140    (Emmanuel Joke via dogacan)
141
14248. NUTCH-25 - needs 'character encoding' detector.
143    (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
144
14549. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
146    to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
147   
14850. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
149    (mattmann)
150   
15151. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
152    list. (Emmanuel Joke, Marcin Okraszewski via kubes)
153
15452. NUTCH-501 -  Implement a different caching mechanism for objects cached in
155    configuration. (dogacan)
156
15753. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
158
15954. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
160
16155. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
162    (dogacan, kubes via dogacan)
163
16456. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
165    (Emmanuel Joke via dogacan)
166
16757. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
168
16958. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
170
17159. NUTCH-574 - Including inlink anchor text in index can create irrelevant
172    search results.  Created index-anchor plugin, removed functionality from
173    index-basic plugin. For backwards compatibility, add index-anchor plugin to
174    nutch-site.xml plugin.includes. (kubes)
175
17660. NUTCH-581 - DistributedSearch does not update search servers added to
177    search-servers.txt on the fly.  (Rohan Mehta via kubes)
178
17961. NUTCH-586 - Add option to run compiled classes without job file
180    (enis via ab)
181
18262. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
183    server. (Susam Pal via dogacan)
184
18563. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
186
18764. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
188    (Emmanuel Joke via ab)
189
19065. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
191
19266. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
193
19467. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
195
19668. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
197
19869. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
199
20070. NUTCH-602 - Allow configurable number of handlers for search servers
201    (hartbecke via kubes)
202
20371. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
204
20572. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
206
20773. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
208
20974. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
210
21175. NUTCH-603 - Add more default url normalizations (kubes)
212
21376. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
214
21577. NUTCH-44 - Too many search results, limits max results returned from a
216    single search. (Emilijan Mirceski and Susam Pal via kubes)
217
21878. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
219    updated to 1.2 version. (dogacan)
220
22179. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
222
22380. NUTCH-612 - URL filtering was disabled in Generator when invoked
224    from Crawl (Susam Pal via ab)
225
22681. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
227
22882. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
229
23083. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
231
23284. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
233    Guard against reprUrl being null. (Emmanuel Joke, ab)
234
23585. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
236    Joke, ab)
237
23886. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
239
24087. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
241
24288. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
243    (Emmanuel Joke, dogacan, ab)
244
24589. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
246    single slash. (Mark DeSpain via ab)
247
24890. NUTCH-500 - Add hadoop masters configuration file into conf folder.
249    (Emmanuel Joke via kubes)
250
25191. NUTCH-596 - ParseSegments parse content even if its not
252    CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
253   
25492. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
255
25693. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
257    Ritter, ab)
258
25994. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
260
26195. NUTCH-645 - Parse-swf unit test failing (ab)
262
26396. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
264
26597. NUTCH-639 - Change LuceneDocumentWrapper visibility from
266    private to _public_ (Guillaume Smet via dogacan)
267
26898. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
269    tracking. (dogacan)
270
27199. NUTCH-375 - Add support for Content-Encoding: deflated
272    (Pascal Beis, ab)
273
274100. NUTCH-633 - ParseSegment no longer allow reparsing.
275     (dogacan)
276
277101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
278
279102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
280
281103. NUTCH-654 - urlfilter-regex's main does not work.
282     (dogacan)
283
284104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
285     (dogacan)
286     
287105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
288
289106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
290
291107. NUTCH-647 - Resolve URLs tool (kubes)
292
293108. NUTCH-665 - Search Load Testing Tool (kubes)
294
295109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
296                 (kubes)
297
298110. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
299
300111. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
301
302112. NUTCH-668 -  Domain URL Filter. (kubes)
303
304113. NUTCH-594 -  Serve Nutch search results in multiple formats including
305                  XML and JSON. (kubes)
306
307114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
308
309115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
310                 fetch interval correctly. (dogacan)
311
312116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
313
314117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
315                 (julien nioche via dogacan)
316
317118. NUTCH-681 - parse-mp3 compilation problem.
318                 (Wildan Maulana via dogacan)
319
320119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
321                 (dogacan)
322
323120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
324                 digest. (dogacan)
325
326121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
327                 (Joseph Chen, dogacan)
328
329122. NUTCH-682 - SOLR indexer does not set boost on the document.
330                 (julien nioche via dogacan)
331
332123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
333
334124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
335
336125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
337
338126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
339     (Curtis d'Entremont, ab)
340
341127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
342
343128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
344     (Stefan Will, siren)
345     
346129. NUTCH-691 - Update jakarta poi jars to the most relevant version
347     (Dmitry Lihachev via siren)
348
349130. NUTCH-563 - Include custom fields in BasicQueryFilter
350     (Julien Nioche via siren)
351     
352131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
353     (Dmitry Lihachev via siren)
354     
355132. NUTCH-694 - Distributed Search Server fails (siren)
356
357133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
358     set at cross domain redirects (Remco Verhoef, dogacan via siren)
359
360134. NUTCH-247 - Robot parser to restrict (kubes, siren)
361
362135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
363     via siren)
364     
365136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
366     Dmitry Lihachev via siren)
367
368137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
369
370138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
371     Doug Cook via ab)
372     
373139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
374
375140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
376
377141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
378
379142. NUTCH-684 - Dedup support for Solr. (dogacan)
380
381143. NUTCH-715 - Subcollection plugin doesn't work with default
382     subcollections.xml file (Dmitry Lihachev via siren)
383     
384144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
385
386Release 0.9 - 2007-04-02
387
388 1. Changed log4j confiquration to log to stdout on commandline
389    tools (siren)
390
391 2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
392 
393 3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
394    siren)
395
396 4. Optionally skip pages with abnormally large values of Crawl-Delay
397    (Dennis Kubes via ab)
398
399 5. Change readdb -stats to use CombiningCollector (ab)
400
401 6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
402    Schneider and Stefan Groschupf via ab)
403
404 7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
405    dependant jars (siren)
406   
407 8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
408    in parse-plugins.xml (Chris A. Mattmann via siren)
409   
410 9. NUTCH-105 - Network error during robots.txt fetch causes file to
411    be ignored (Greg Kim via siren)
412   
41310. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
414
41511. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
416    to the current page (e.g. anchors). (Stefan Groschupf via ab)
417
41812. NUTCH-365 - Flexible URL normalization (ab)
419
42013. NUTCH-336 - Differentiate between newly discovered pages and newly
421    injected pages (Chris Schneider via ab) NOTE: this changes the
422    scoring API, filter implementations need to be updated.
423
42414. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
425    via ab)
426
42715. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
428    (Stefan Groschupf via ab)
429
43016. NUTCH-374 - when http.content.limit be set to -1 and 
431    Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
432    (King Kong via pkosiorowski)
433
43417. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
435
436  ****************************** WARNING !!! ********************************
437  * This upgrade breaks data format compatibility. A tool 'convertdb'       *
438  * was added to migrate existing CrawlDb-s to the new format. Segment data *
439  * can be partially migrated using 'mergesegs', however segments will      *
440  * require re-parsing (and consequently re-indexing).                      *
441  ****************************** WARNING !!! ********************************
442
44318. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
444    the algorithm. (ab)
445
44619. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
447    find parser (siren)
448
44920. NUTCH-379 - ParseUtil does not pass through the content's URL to the
450    ParserFactory (Chris A. Mattmann via siren)
451
45221. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
453    partition. (ab)
454
45522. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
456
45723. NUTCH-395 - Increase fetching speed (siren)
458
45924. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
460    (reported by Jared Dunne)
461
46225. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
463
46426. NUTCH-403 - Make URL filtering optional in Generator (siren)
465
46627. NUTCH-405 - Content object is not properly initialized in map method
467    of ParseSegment (siren)
468
46928. NUTCH-362 - Remove parse-text from unsupported filetypes in
470    parse-plugins.xml (siren)
471   
47229. NUTCH-305 - Update crawl and url filter lists to exclude
473    jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan
474    Neufeind) is also updated (siren)
475   
47630. NUTCH-406 - Metadata tries to write null values (mattmann)
477
47831. NUTCH-415 - Generator should mark selected records in CrawlDb.
479    Due to increased resource consumption this step is optional.
480    Application-level locking has been added to prevent concurrent
481    modification of databases. (ab)
482
48332. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
484    now possible to correctly update CrawlDb from multiple segments.
485    Introduce new status codes for temporary and permanent
486    redirection. (ab)
487
48833. NUTCH-322 - Fix Fetcher to store redirected pages and to store
489    protocol-level status. This also should fix NUTCH-273. (ab)
490
49134. Change default Fetcher behavior not to follow redirects immediately.
492    Instead Fetcher will record redirects as new pages to be added to CrawlDb.
493    This also partially addresses NUTCH-273. (ab)
494
49535. Detect and report when Generator creates 0-sized segments. (ab)
496
49736. Fix Injector to preserve already existing CrawlDatum if the seed list
498    being injected also contains such URL. (ab)
499
50037. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
501    skipping bad URLs. (Michael Stack via ab)
502
50338. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
504    Filters that are not in plugin.includes (Stefan Groschupf, siren)
505   
50639. NUTCH-421 - Allow predeterminate running order of indexing filters
507    (Alan Tanaman, siren)
508
50940. When indexing pages with redirection, drop all intermediate pages and
510    index only the final page. (ab)
511
51241. Upgrade to Hadoop 0.10.1. (ab)
513
51442. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
515    order in which IndexDoc-s are processed. (Dogacan Guney via ab)
516
51743. NUTCH-428 - NullPointerException thrown when agent name is not
518    configured properly. Changed to throw RuntimeException instead.
519    (siren)
520
52144. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
522
52345. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
524
52546. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
526    or indexing from hadoop.io.DataOutputBuffer (siren)
527
52847. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
529
53048. NUTCH-390 - Javadoc warnings (mattmann)
531
53249. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
533
53450. NUTCH-432 - Fix a bug where platform name with spaces would break the
535    bin/nutch script. (Brian Whitman via ab)
536
53751. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
538
53952. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
540
54153. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
542    framework to operate properly (Heiko Dietze via mattmann)
543
54454. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
545    Groschupf via kubes)
546   
54755. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
548    path is empty (kubes)
549
55056. Upgrade to Hadoop 0.12.1 release. (ab)
551
55257. NUTCH-246 - Incorrect segment size being generated due to time
553    synchronization issue (Stefan Groschupf via ab)
554
55558. Upgrade to Hadoop 0.12.2 release. (ab)
556
55759. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
558    Stack and Dogacan Guney via kubes)
559
560Release 0.8 - 2006-07-25
561
562 0. Totally new architecture, based on hadoop
563    [http://lucene.apache.org/hadoop] (cutting)
564
565 1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
566
567 2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
568    (Rod Taylor via cutting)
569
570 3. NUTCH-88 - Enhance ParserFactory plugin selection policy
571    (jerome)
572
573 4. NUTCH-124 - Protocol-httpclient does not follow redirects when
574    fetching robots.txt (cutting)
575
576 5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
577    (stack@archive.org, cutting)
578
579 6. NUTCH-114 - Getting number of urls and links from crawldb
580    (Stefan Groschupf via ab)
581
582 7. NUTCH-112 - Link in cached.jsp page to cached content is an
583    absolute link (Chris A. Mattmann via jerome)
584
585 8. NUTCH-135 - Http header meta data are case insensitive in the
586    real world (Stefan Groschupf via jerome)
587
588 9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
589    to UTF-8 BOM (KuroSaka TeruHiko via siren)
590
59110. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
592
59311. Added support for OpenSearch (cutting)
594
59512. NUTCH-142 - NutchConf should use the thread context classloader
596    (Mike Cannon-Brookes via pkosiorowski)
597
59813. NUTCH-160 - Use standard Java Regex library rather than
599    org.apache.oro.text.regex (Rod Taylor via cutting)
600
60114. NUTCH-151 - CommandRunner can hang after the main thread exec is
602    finished and has inefficient busy loop (Paul Baclace via cutting)
603
60415. NUTCH-174 - Problem encountered with ant during compilation
605
60616. NUTCH-190 - ParseUtil drops reason for failed parse
607    (stack@archive.org via ab)
608
60917. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
610
61118. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
612
61319. NUTCH-178 - in search.jsp must be session creation "false"
614    (YourSoft via siren)
615
61620. NUTCH-200 - OpenSearch Servlet ist broken
617    (Marko Bauhardt via siren)
618
61921. NUTCH-81 - Webapp only works when deployed in root
620    (AJ Banck, Michael Nebel via siren)
621
62222. NUTCH-139 - Standard metadata property names in the ParseData
623    metadata (Chris A. Mattmann, jerome)
624
62523. NUTCH-192 - Meta data support for CrawlDatum
626    (Stefan Groschupf via ab)
627   
62824. NUTCH-52 - Parser plugin for MS Excel files
629    (Rohit Kulkarni via jerome)
630
63125. NUTCH-53 -  Parser plugin for Zip files
632    (Rohit Kulkarni via jerome)
633
63426. NUTCH-137 - footer is not displayed in search result page
635    (KuroSaka TeruHiko via siren)
636
63727. NUTCH-118 - FAQ link points to invalid URL
638    (Steve Betts via siren)
639
64028. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
641    translation (Ivan Sekulovic via siren)
642
64329. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
644    via cutting)
645
64630. NUTCH-140 - Add alias capability in parse-plugins.xml file that
647    allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
648
64931. NUTCH-214 - Added Links to web site to search mailling list
650    (Jake Vanderdray via jerome)
651
65232. NUTCH-204 - Multiple field values in HitDetails
653    (Stefan Groschupf via jerome)
654
65533. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
656    to -1 to be consistent with http (jerome)
657   
65834. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
659
66035. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
661    pkosiorowski)
662
66336. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
664    jerome)
665
66637. NUTCH-229 - Improved handling of plugin folder configuration
667    (Stefan Groschupf via ab)
668
66938. NUTCH-206 - Search server throws InstantiationException (ab)
670   
67139. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
672    via ab)
673
67440. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
675
67641. Update to lucene 1.9.1 (cutting)
677
67842. NUTCH-235 - Duplicate Inlink values (ab)
679
68043. NUTCH-234 - Clustering extension code cleanups and a real
681    JUnit test case for the current implementation (Dawid Weiss via ab)
682   
68344. NUTCH-210 - Context.xml file for Nutch web application
684    (Chris A. Mattmann via jerome)
685
68645. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
687
68846. NUTCH-232 - Search.jsp has multiple search forms creating
689    invalid html / incorrect focus function (jerome)
690   
69147. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
692
69348. NUTCH-244 - Inconsistent handling of property values
694    boundaries / unable to set db.max.outlinks.per.page to
695    infinite (jerome)
696   
69749. NUTCH-245 - DTD for plugin.xml configuration files
698    (Chris A. Mattmann via jerome)
699
70050. NUTCH-250 - Generate to log truncation caused by
701    generate.max.per.host (Rod Taylor via cutting)
702   
70351. NUTCH-125 - OpenOffice Parser plugin (ab)
704
70552. Switch from using java.io.File to org.apache.hadoop.fs.Path.
706    (cutting)
707
70853. NUTCH-240 - Scoring API: extension point, scoring filters and
709    an OPIC plugin (ab)
710   
71154. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
712
71355. NUTCH-268 - Generator and lib-http use different definitions of
714    "unique host" (ab)
715   
71656. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
717    via siren)
718
71957. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
720    (Dennis Kubes via ab)
721
72258. NUTCH-201 - Add support for subcollections
723    (siren)
724
72559. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
726    (Stefan Groschupf via jerome)
727
72860. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
729
73061. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
731    (Stefan Groschupf via jerome)
732
73362. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
734    (stack@archive.org via siren)
735
73663. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
737    (Stefan Neufeind via siren)
738
73964. NUTCH-307 - Wrong configured log4j.properties (jerome)
740
74165. NUTCH-303 - Logging improvements (jerome)
742
74366. NUTCH-308 - Maximum search time limit (ab)
744
74567. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
746    problem (Grant Glouser via siren)
747
74868. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
749
75069. NUTCH-317 - Clarify what the queryLanguage argument of
751    Query.parse(...) means (jerome)
752
75370. Added alternative experimental web gui in contrib containing
754    extensions like subcollection, keymatch, user preferences,
755    caching, implemented mainly using tiles and jstl (siren)
756
75771. NUTCH-320 DmozParser does not output list of urls to stdout
758    but to a log file instead. Original functionality restored.
759
76072. NUTCH-271 - Add ability to limit crawling to the set of initially
761    injected hosts (db.ignore.external.links) (Philippe Eugene,
762    Stefan Neufeind via ab)
763
76473. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
765
76674. NUTCH-327 - Fixed logging directory on cygwin (siren)
767
768Release 0.7 - 2005-08-17
769
770 1. Added support for "type:" in queries. Search results are limited/qualified
771    by mimetype or its primary type or sub type. For example,
772    (1) searching with "type:application/pdf" restricts results
773    to pages which were identified to be of mimetype "application/pdf".
774    (2) with "type:application", nutch will return pages of
775    primary type "application".
776    (3) with "type:pdf", only pages of sub type "pdf" will be listed.
777    (John Xing, 20050120)
778
779 2. Added support for "date:" in queries. Last-Modified is indexed.
780    Search results are restricted by lower and upper date (inclusive)
781    as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
782    only returns pages with Last-Modified in year 2004.
783    (John Xing, 20050122)
784
785 3. Add URLFilter plugin interface and convert existing url filters into
786    plugins. (John Xing, 20050206)
787
788 4. Add UpdateSegmentsFromDb tool, which updates the scores and
789    anchors of existing segments with the current values in the web
790    db.  This is used by CrawlTool, so that pages are now only fetched
791    once per crawl.  (Doug Cutting, 20050221)
792
793 5. Moved code into org.apache.nutch sub-packages.  Changed license to
794    Apache 2.0.  Removed jar files whose licenses do not permit
795    redistribution by Apache.  Disabled compilation of plugins which
796    require these libraries.  (Doug Cutting 20050301)
797
798 6. Index host and title in separate fields.  Host was indexed
799    previously only as a part of the URL.  Title was indexed as an
800    anchor.  Now boosts for matching these fields may be adjusted
801    separately from boosts for matching anchors and url.  Also: move
802    site indexing to index-basic plugin to minimize the number of
803    times the URL needs to be parsed; and, stop using anchor analyzer
804    for anything but anchors.  (Piotr Kosiorowski via Doug Cutting
805    20050323)
806
807 7. Add servlet Cached.java that serves cached Content of any mime type.
808    Slightly modified are web.xml and cached.jsp.
809    (John Xing, 20050401)
810
811 8. Add skipCompressedByteArray() to WritableUtils.java.
812    (John Xing, 20050402)
813
814 9. Fixes to jsp and static web pages.  These now use relative links,
815    so that the Nutch webapp file can be used in places other than at
816    the root.  Also fixed links to the about and help pages.  Bug #32.
817    (Jerome Charron via cutting, 20050404)
818
81910. Added some features to DistributedSearch: new segments can be added
820    to searchservers without restarting the frontend, defective search
821    servers are not queried until tey come back online, watchdog keeps
822    an eye for your searchservers and writes simple statistics.
823    (Sami Siren, 20050407)
824   
82511. Fix for bug #4 - Unbalanced quote in query eats all resources.
826  (Piotr Kosiorowski, Sami Siren, 20050407)
827
82812. Close Issue #33 - MIME content type detector (using magic char sequences).
829    (Jerome Charron and Hari Kodungallur via John Xing, 20050416)
830
83113. Add a servlet that implements A9's OpenSearch RSS web service.
832    (cutting, 20050418)
833
83414. Remove references to link analysis from tutorial, and enable
835    scoring by link count when generating fetchlists and searching.
836    (cutting, 20040419)
837
83815. Make query boosts for host, title, anchor and phrase matches
839    configurable.  (Piotr Kosiorowski via cutting, 20050419)
840
84116. Add support for sorting search results and search-time deduping by
842    fields other than site.
843
84417. Automatically convert range queries into cached range filters.
845    This improves the performance and scalability of, e.g., date range
846    searching.
847
84818. Several methods have been renamed due to misspellings.  The old
849    methods have been deprecated and will be removed before the 1.0
850    release.
851
852
853Release 0.6
854
855 1. Added clustering-carrot2 plugin, together with introduction of clustering
856    api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
857
858 2. Make a number of changes to NDFS (Nutch Distributed File System)
859    to fix bugs, add admin tools, etc.
860
861    Also, modify all command line tools so you can indicate whether to
862    use NDFS or the local filesystem.  If you indicate nothing, then
863    it defaults to the local fs.
864
865    I've used this to do a 35m page crawl via NDFS, distributed over a
866    dozen machines.  (Mike Cafarella)
867
868 3. Add support for BASE tags in HTML.  Outlinks are now correctly
869    extracted when a BASE tag is present.  (cutting)
870
871 4. Fix two bugs in result pagination.  When the last hit on a page
872    was the last hit overall, the "next" button was sometimes shown
873    when the "show all" button should be shown instead.  Also, in
874    certain cases, the "show all" button would be shown when the
875    "next" button should have been shown.  (cutting)
876
877 5. Add config parameter "indexer.max.tokens" that determines the
878    maximum number of tokens indexed per field.  (Andy Hedges via cutting)
879
880 6. Add parser for mp3 files.  (Andy Hedges via cutting)
881
882 7. Add RegexUrlNormalizer.  This is useful for things like stripping
883    out session IDs from URLs.  To use it, add values for
884    urlnormalizer.class and urlnormalizer.regex.file to your
885    nutch-site.xml.  The RegexUrlNormalizer class extends the
886    BasicUrlNormalizer, and does basic normalization as well.
887    (Luke Baker via cutting)
888
889 8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
890
891 9. Added Polish translation (Andrzej Bialecki, 20040911)
892 
89310. Added 3 more language profiles to language identifier (ru,hu,pl).
894  Other changes to language identifier: Porfiles converted to utf8,
895  added some test cases, changed the similarity calculation.
896  (Sami Siren, 20040925)
897
89811. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
899
90012. Added plugin index-more and more.jsp (John Xing, 20041003)
901
90213. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
903    in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
904
90514. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
906    (but not search.jsp) with NullPointerException in distributed search.
907    It seems that this bug appears after "hits per site" stuff is added.
908    The fix is done in Hit.java, making sure String site is never null.
909    Hope this fix not have bad effetct on "hits per site" code.
910    (John Xing, 20041006)
911
91215. Fixed a bug that fails fullyDelete() in FileUtil.java for
913    LocalFileSystem.java. This bug also exposes possible incompleteness
914    of NDFSFile.java, where a few methods are not supported, including
915    delete(). Nothing changed in NDFSFile.java though. Leave it for future
916    improvement (John Xing, 20041022).
917
91816. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
919    A new status code CANT_PARSE is added to FetcherOutput.java.
920    Without option -noParsing , no change in fetcher behavior. With
921    option -noParsing, fetcher does crawls only, no parsing is carried out.
922    Then, ParseSegment.java should be used to parse in separate pass.
923    (John Xing, 20041025)
924
92517. Added ontology plugin. Currently it is used for query refinement, as
926    examplified in refine-query-init.jsp and refine-query.jsp. By default,
927    query refinement is disabled in search.jsp. Please check
928    ./src/plugin/ontology/README.txt for further description.
929    Ontology plugin certainly can be used for many other things.
930    (Michael J. Pan via John Xing, 20041129)
931 
93218. Changed fetcher.server.delay to be a float, so that sub-second
933    delays can be specified.  (cutting)
934
93519. Added plugin.includes config parameter that determines which
936    plugins are included.  By default now only http, html and basic
937    indexing and search plugins are enabled, rather than all plugins.
938    This should make default performance more predictable and reliable
939    going forward. (cutting)
940
94120. Cleaned up some filesystem code, including:
942
943    - Replaced BufferedRandomAccessFile with two simpler utilties,
944      NFSDataInputStream and NFSDataOutputStream.
945
946    - Fixed the bug where SequenceFiles were no longer flushed when
947      created, so that, when fetches crashed, segments were
948      unreadable.  Now segments are always readable after crashes.
949      Only the contents of the last buffer is lost.
950
951    - Simplified the FSOutputStream API to not include seek().  We
952      should never need that functionality.
953
954    - Simplified LocalFileSystem's implementations of FSInputStream
955      and FSOutputStream and optimized FSInputStream.seek().
956
957    (cutting)
958
95921. Fixed BasicUrlNormalizer to better handle relative urls.  The file
960    part of a URL is normalized in the following manner:
961
962      1. "/aa/../" will be replaced by "/" This is done step by step until
963   the url doesn´t change anymore. So we ensure, that
964   "/aa/bb/../../" will be replaced by "/", too
965
966      2. leading "/../" will be replaced by "/"
967
968    (Sven Wende via cutting)
969
97022. Fix Page constructors so that next fetch date is less likely to be
971    misconstrued as a float.  This patches a problem in WebDBInjector,
972    where new pages were added to the db with nextScore set to the
973    intended nextFetch date.  This, in turn, confused link analysis.
974
97523. In ndfs code, replace addLocalFile(), putToLocalFile() with
976    copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
977    moveToLocalFile(). (John Xing, 20041217)
978
97924. Added new config parameter fetcher.threads.per.host.  This is used
980    by the Http protocol.  When this is one behavior is as before.
981    When this is greater than one then multiple threads are permitted
982    to access a host at once.  Note that fetcher.server.delay is no
983    longer consistently observed when this is greater than one.
984    (Luke Baker via Doug Cutting)
985
986Release 0.5
987
988 1. Changed plugin directory to be a list of directories.
989
990 2. Permit Plugin to be the default plugin implementation.
991
992 3. Added pluggable interface for network protocols in new package
993    net.nutch.protocol.  Moved http code from core into a plugin.
994
995 4. Added pluggable interface for content parsing in new package
996    net.nutch.parse.  Moved html parsing code from core into a
997    plugin.
998
999 5. Fixed a bug in NutchAnalysis where 16-bit characters were not
1000    processed correctly.
1001
1002 6. Fixed bug #971731: random summaries on result page.
1003    (Daniel Naber via cutting)
1004
1005 7. Made Nutch logo transparent. (Daniel Naber via cutting)
1006
1007 8. Added file protocol plugin.  (John Xing via cutting)
1008
1009 9. Added ftp protocol plugin.  (John Xing via cutting)
1010
101110. Added pdf and msword parser plugins.  (John Xing via cutting)
1012
101311. Added pluggable indexing interface.  By default, url, content,
1014    anchors and title are indexed, as before, but now one can easily
1015    alter this to, e.g., index metadata.  A demonstration is provided
1016    which extracts and indexes Creative Commons license urls. (cutting)
1017
101812. Add language identification plugin.
1019
1020    The process of identification is as follows:
1021
1022    1. html (html only, HTML 4.0 "lang" attribute)
1023    2. meta tags (html only, http-equiv, dc.language)
1024    3. http header (Content-Language)
1025    4. if all above fail "statistical analysis"
1026
1027    1 & 2 are run during the fetching phase and 3 & 4 are run on
1028    indexing phase.
1029
1030    Currently supported languages (in "statistical analysis") are
1031    da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
1032    from http://www.isi.edu/~koehn/europarl/ and the profiles were
1033    build with tool supplied in patch.
1034
1035    After indexing the language can be found from field named "lang"
1036
1037    It's not 100% accurate but it's a start.
1038    (Sami Siren)
1039
104013. Added SegmentMergeTool and "mergesegs" command, to remove
1041    duplicated or otherwise not used content from several segments and
1042    joining them together into a single new segment.  The tool also
1043    optionally performs several other steps required for proper
1044    operation of Nutch - such as indexing segments, deleting
1045    duplicates, merging indices, and indexing the new single segment.
1046    (Andrzej Bialecki)
1047
104814. Add the ability to retrieve ParseData of a search hit. ParseData
1049    contains many valuable properties of a search hit.
1050
1051    This is required (among others) to properly display the cached
1052    content because it's not possible to determine the character
1053    encoding from the output of the getContent() method (which returns
1054    byte[]). The symptoms are that for HTML pages using non-latin1 or
1055    non-UTF8 encodings the cached preview will almost certainly look
1056    broken. Using the attached patch it is possible to determine the
1057    character encoding from the ParseData (for HTTP: Content-Type
1058    metadata), and encode the content accordingly. (Andrzej Bialecki)
1059
106015. Add a pluggable query interface.  By default, the content, anchor
1061    and url fields are searched as before.  A sample plugin indexes
1062    the host name and adds a "site:" keyword to query parsing.
1063
106416. Added support for "lang:" in queries.  For example, searching with
1065    "lang:en" restricts results to pages which were identified to
1066    be in English.
1067
106817. Automatically optimize field queries to use cached Lucene filters.
1069    This makes, for example, searches restricted by languages or sites
1070    that are very common much faster.
1071
107218. Improved charset handling in jsp pages.  (jshin by cutting)
1073
107419. Permit topic filtering when injecting DMOZ pages.  (jshin by cutting)
1075
107620. When parsing crawled pages, interpret charset specifications in
1077    html meta tags.  (jshin by cutting)
1078
107921. Added support for "cc:licensed" in queries, which searches for documents
1080    released under Creative Commons licenses.  Attributes of the
1081    license may also be queried, with, e.g., "cc:by" for
1082    attribution-required licenses, "cc:nc" for non-commercial
1083    licenses, etc.
1084
108522. Relative paths named in plugin.folders are now searched for on the
1086    classpath.  This makes, e.g., deployment in a war file much simpler.
1087
108823. Modifications to Fetcher.java.
1089
1090    1. Make sure it works properly with regard to creation and initialization
1091    of plugin instances. The problem was that multiple threads race to
1092    startUp() or shutDown() plugin instances. It was solved by synchronizing
1093    certain codes in PluginRepository.java and Extension.java.
1094    (Stefan Groschupf via John Xing)
1095
1096    2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
1097    may never return (quit) if there are still data or other structures
1098    (e.g., persistent socket connections) associated with plugins. (John Xing)
1099   
1100    3. Fixed one type of Fetcher "hang" problems by monitoring named
1101    FetcherThreads. If all FetcherThreads are gone (finished),
1102    Fetcher.java is considered done. The problem was: there could be
1103    runaway threads started by external libs via FetcherThreads.
1104    Those threads never return, thus keep Fetcher from exiting normally.
1105    (John Xing)
1106
110724. Eliminate excessive hits from sites.  This is done efficiently by
1108    adding the site name to Hit instances, and, when needed,
1109    re-querying with too-frequent sites prohibited in the query.
1110
1111
1112Release 0.4
1113
1114 1. Http class refactored.  (Kevin Smith via Tom Pierce)
1115
1116 2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
1117
1118 3. Added Japanese translation. (Yukio Andoh via Doug Cutting)
1119
1120 4. Updated Dutch translation. (Ype Kingma via Doug Cutting)
1121
1122 5. Initial version of Distributed DB code.  (Mike Cafarella)
1123
1124 6. Make things more tolerant of crashed fetcher output files.
1125    (Doug Cutting)
1126
1127 7. New skin for website. (Frank Henze via Doug Cutting)
1128
1129 8. Added Spanish translation. (Diego Basch via Doug Cutting)
1130
1131 9. Add FTP support to fetcher.  (John Xing via Doug Cutting)
1132
113310. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
1134
113511. Added Robots.txt & throttling support to Fetcher.java.  (Mike
1136    Cafarella)
1137
113812. Added nightly build. (Doug Cutting)
1139
114013. Default all link scores to 1.0. (Doug Cutting)
1141
114214. Permit one to keep internal links. (Doug Cutting)
1143
114415. Fixed dedup to select shortest URL. (Doug Cutting)
1145
114616. Changed index merger so that merged index is written to named
1147    directory, rather than to a generated name in that directory.
1148    (Doug Cutting)
1149
115017. Disable coordination weighting of query clauses and other minor
1151    scoring improvements. (Doug Cutting)
1152
115318. Added a new command, crawl, that constructs a database, injects a
1154    url file and performs a few rounds of generate/fetch/updatedb.
1155    This simplifies use for intranet sites.  Changed some defaults to
1156    be more intranet friendly.  (Doug Cutting)
1157
115819. Fixed a bug where Fetcher.java didn't construct correct relative
1159    links when a page was redirected.  (Doug Cutting)
1160
116120. Fixed a query parser problem with lookahead over plusses and minuses.
1162    (Doug Cutting)
1163
116421. Add support for HTTP proxy servers.  (Sami Siren via Doug Cutting)
1165
116622. Permit searching while fetching and/or indexing.
1167    (Sami Siren via Doug Cutting)
1168
116923. Fix a bug when throttling is disabled.  (Sami Siren via Doug Cutting)
1170
117124. Updated Bahasa Malaysia translation.  (Michael Lim via Doug Cutting)
1172
117325. Added Catalan translation.  (Xavier Guardiola via Doug Cutting)
1174
117526. Added brazilian portuguese translation.
1176    (A. Moreir via Doug Cutting)
1177
117827. Added a french translation.  (Julien Nioche via Doug Cutting)
1179
118028. Updated to Lucene 1.4RC3.  (Doug Cutting)
1181
118229. Add capability to boost by link count & use it in crawl tool.
1183    (Doug Cutting)
1184
118530. Added plugin system.  (Stefan Groschupf via Doug Cutting)
1186
118731. Add this change log file, for recording significant changes to
1188    Nutch.  Populate it with changes from the last few months.
Note: See TracBrowser for help on using the repository browser.