Context Navigation

source: nutchez-0.1/CHANGES.txt @ 66

Last change on this file since 66 was 66, checked in by waue, 15 years ago
NutchEz - an easy way to nutch
File size: 43.8 KB

Line
1	Nutch Change Log
2
3	Release 1.0 - 2009-03-23
4
5	1. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
6
7	2. NUTCH-443 - Allow parsers to return multiple Parse objects.
8	(Dogacan Guney et al, via ab)
9
10	3. NUTCH-393 - Indexer should handle null documents returned by filters.
11	(Eelco Lempsink via ab)
12
13	4. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
14
15	5. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
16	bots in robots.txt (Dogacan Guney via siren)
17
18	6. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
19
20	7. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
21	(siren)
22
23	8. NUTCH-161 - Change Plain text parser to
24	use parser.character.encoding.default property for fall back encoding
25	(KuroSaka TeruHiko, siren)
26
27	9. NUTCH-61 - Support for adaptive re-fetch interval and detection of
28	unmodified content. (ab)
29
30	10. NUTCH-392 - OutputFormat implementations should pass on Progressable.
31	(cutting via ab)
32
33	11. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
34
35	12. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
36	up the rss parser (dogacan via mattmann). This update is a fix and semantics
37	change from the original patch for NUTCH-443. The original patch did not tell
38	the Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
39	datums. This patch addresses that issue. Now, if Fetcher gets a null content,
40	instead of pushing an empty content, it filters the null content.
41
42	13. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
43	Parse object. (Gal Nitzan via dogacan)
44
45	14. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
46	some query parameters. (Emmanuel Joke via dogacan)
47
48	15. NUTCH-502 - Bug in SegmentReader causes infinite loop.
49	(Ilya Vishnevsky via dogacan)
50
51	16. NUTCH-444 Possibly use a different library to parse RSS feed for improved
52	performance and compatibility. This patch introduced a new plugin, feed,
53	that includes an index filter and a parse plugin for feeds that uses ROME.
54	There was discussion to remove parse-rss, in light of the feed plugin,
55	however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
56
57	17. NUTCH-471 - Fix synchronization in NutchBean creation.
58	(Enis Soztutar via dogacan)
59
60	18. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
61
62	19. NUTCH-468 - Scoring filter should distribute score to all outlinks at
63	once. (dogacan)
64
65	20. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
66
67	21. NUTCH-497 - Extreme Nested Tags causes StackOverflowException in
68	DomContentUtils...Spider Trap. (kubes)
69
70	22. NUTCH-434 - Replace usage of ObjectWritable with something based on
71	GenericWritable. (dogacan)
72
73	23. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
74
75	24. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
76	(Espen Amble Kolstad via dogacan)
77
78	25. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
79	(Emmanuel Joke via dogacan)
80
81	26. NUTCH-503 - Generator exits incorrectly for small fetchlists.
82	(Vishal Shah via dogacan)
83
84	27. NUTCH-505 - Outlink urls should be validated. (dogacan)
85
86	28. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
87
88	29. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
89
90	30. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
91
92	30. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
93
94	31. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
95
96	32. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
97	(Enis Soztutar via dogacan)
98
99	33. NUTCH-516 - Next fetch time is not set when it is a
100	CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
101
102	34. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
103	when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
104
105	35. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
106	(dogacan) Note: There is a bigger problem, i.e how to deal
107	with redirected pages, and this issue can be considered as a band-aid
108	for the time being. See NUTCH-273 and NUTCH-353 for more details.
109
110	36. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
111	inlinks list. (Emmanuel Joke via dogacan)
112
113	37. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
114	parse. (dogacan)
115
116	38. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
117
118	39. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
119
120	40. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
121	domain-related utilities. (Enis Soztutar via dogacan)
122
123	41. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
124	release (2.1). (Dawid Weiss via dogacan)
125
126	42. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
127	request. (Dawid Weiss via dogacan)
128
129	43. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
130	(Emmanuel Joke via dogacan)
131
132	44. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
133
134	45. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
135
136	46. NUTCH-554 - Generator throws IOException on invalid urls.
137	(Brian Whitman via ab)
138
139	47. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
140	(Emmanuel Joke via dogacan)
141
142	48. NUTCH-25 - needs 'character encoding' detector.
143	(Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
144
145	49. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
146	to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
147
148	50. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
149	(mattmann)
150
151	51. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
152	list. (Emmanuel Joke, Marcin Okraszewski via kubes)
153
154	52. NUTCH-501 - Implement a different caching mechanism for objects cached in
155	configuration. (dogacan)
156
157	53. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
158
159	54. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
160
161	55. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
162	(dogacan, kubes via dogacan)
163
164	56. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
165	(Emmanuel Joke via dogacan)
166
167	57. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
168
169	58. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
170
171	59. NUTCH-574 - Including inlink anchor text in index can create irrelevant
172	search results. Created index-anchor plugin, removed functionality from
173	index-basic plugin. For backwards compatibility, add index-anchor plugin to
174	nutch-site.xml plugin.includes. (kubes)
175
176	60. NUTCH-581 - DistributedSearch does not update search servers added to
177	search-servers.txt on the fly. (Rohan Mehta via kubes)
178
179	61. NUTCH-586 - Add option to run compiled classes without job file
180	(enis via ab)
181
182	62. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
183	server. (Susam Pal via dogacan)
184
185	63. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
186
187	64. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
188	(Emmanuel Joke via ab)
189
190	65. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
191
192	66. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
193
194	67. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
195
196	68. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
197
198	69. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
199
200	70. NUTCH-602 - Allow configurable number of handlers for search servers
201	(hartbecke via kubes)
202
203	71. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
204
205	72. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
206
207	73. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
208
209	74. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
210
211	75. NUTCH-603 - Add more default url normalizations (kubes)
212
213	76. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
214
215	77. NUTCH-44 - Too many search results, limits max results returned from a
216	single search. (Emilijan Mirceski and Susam Pal via kubes)
217
218	78. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
219	updated to 1.2 version. (dogacan)
220
221	79. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
222
223	80. NUTCH-612 - URL filtering was disabled in Generator when invoked
224	from Crawl (Susam Pal via ab)
225
226	81. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
227
228	82. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
229
230	83. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
231
232	84. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
233	Guard against reprUrl being null. (Emmanuel Joke, ab)
234
235	85. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
236	Joke, ab)
237
238	86. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
239
240	87. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
241
242	88. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
243	(Emmanuel Joke, dogacan, ab)
244
245	89. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
246	single slash. (Mark DeSpain via ab)
247
248	90. NUTCH-500 - Add hadoop masters configuration file into conf folder.
249	(Emmanuel Joke via kubes)
250
251	91. NUTCH-596 - ParseSegments parse content even if its not
252	CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
253
254	92. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
255
256	93. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
257	Ritter, ab)
258
259	94. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
260
261	95. NUTCH-645 - Parse-swf unit test failing (ab)
262
263	96. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
264
265	97. NUTCH-639 - Change LuceneDocumentWrapper visibility from
266	private to _public_ (Guillaume Smet via dogacan)
267
268	98. NUTCH-651 - Remove bin/{start\|stop}-balancer.sh from svn
269	tracking. (dogacan)
270
271	99. NUTCH-375 - Add support for Content-Encoding: deflated
272	(Pascal Beis, ab)
273
274	100. NUTCH-633 - ParseSegment no longer allow reparsing.
275	(dogacan)
276
277	101. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
278
279	102. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
280
281	103. NUTCH-654 - urlfilter-regex's main does not work.
282	(dogacan)
283
284	104. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
285	(dogacan)
286
287	105. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
288
289	106. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
290
291	107. NUTCH-647 - Resolve URLs tool (kubes)
292
293	108. NUTCH-665 - Search Load Testing Tool (kubes)
294
295	109. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
296	(kubes)
297
298	110. NUTCH-635 - LinkAnalysis Tool for Nutch. (kubes)
299
300	111. NUTCH-646 - New Indexing Framework for Nutch. (kubes)
301
302	112. NUTCH-668 - Domain URL Filter. (kubes)
303
304	113. NUTCH-594 - Serve Nutch search results in multiple formats including
305	XML and JSON. (kubes)
306
307	114. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
308
309	115. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
310	fetch interval correctly. (dogacan)
311
312	116. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
313
314	117. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
315	(julien nioche via dogacan)
316
317	118. NUTCH-681 - parse-mp3 compilation problem.
318	(Wildan Maulana via dogacan)
319
320	119. NUTCH-676 - MapWritable is written inefficiently and confusingly.
321	(dogacan)
322
323	120. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
324	digest. (dogacan)
325
326	121. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
327	(Joseph Chen, dogacan)
328
329	122. NUTCH-682 - SOLR indexer does not set boost on the document.
330	(julien nioche via dogacan)
331
332	123. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
333
334	124. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
335
336	125. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
337
338	126. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
339	(Curtis d'Entremont, ab)
340
341	127. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
342
343	128. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
344	(Stefan Will, siren)
345
346	129. NUTCH-691 - Update jakarta poi jars to the most relevant version
347	(Dmitry Lihachev via siren)
348
349	130. NUTCH-563 - Include custom fields in BasicQueryFilter
350	(Julien Nioche via siren)
351
352	131. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
353	(Dmitry Lihachev via siren)
354
355	132. NUTCH-694 - Distributed Search Server fails (siren)
356
357	133. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
358	set at cross domain redirects (Remco Verhoef, dogacan via siren)
359
360	134. NUTCH-247 - Robot parser to restrict (kubes, siren)
361
362	135. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
363	via siren)
364
365	136. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
366	Dmitry Lihachev via siren)
367
368	137. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
369
370	138. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
371	Doug Cook via ab)
372
373	139. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
374
375	140. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
376
377	141. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
378
379	142. NUTCH-684 - Dedup support for Solr. (dogacan)
380
381	143. NUTCH-715 - Subcollection plugin doesn't work with default
382	subcollections.xml file (Dmitry Lihachev via siren)
383
384	144. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
385
386	Release 0.9 - 2007-04-02
387
388	1. Changed log4j confiquration to log to stdout on commandline
389	tools (siren)
390
391	2. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
392
393	3. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
394	siren)
395
396	4. Optionally skip pages with abnormally large values of Crawl-Delay
397	(Dennis Kubes via ab)
398
399	5. Change readdb -stats to use CombiningCollector (ab)
400
401	6. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
402	Schneider and Stefan Groschupf via ab)
403
404	7. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
405	dependant jars (siren)
406
407	8. NUTCH-338 - Remove the text parser as an option for parsing PDF files
408	in parse-plugins.xml (Chris A. Mattmann via siren)
409
410	9. NUTCH-105 - Network error during robots.txt fetch causes file to
411	be ignored (Greg Kim via siren)
412
413	10. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
414
415	11. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
416	to the current page (e.g. anchors). (Stefan Groschupf via ab)
417
418	12. NUTCH-365 - Flexible URL normalization (ab)
419
420	13. NUTCH-336 - Differentiate between newly discovered pages and newly
421	injected pages (Chris Schneider via ab) NOTE: this changes the
422	scoring API, filter implementations need to be updated.
423
424	14. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
425	via ab)
426
427	15. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
428	(Stefan Groschupf via ab)
429
430	16. NUTCH-374 - when http.content.limit be set to -1 and
431	Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
432	(King Kong via pkosiorowski)
433
434	17. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
435
436	**************************** WARNING !!! ******************************
437	* This upgrade breaks data format compatibility. A tool 'convertdb' *
438	* was added to migrate existing CrawlDb-s to the new format. Segment data *
439	* can be partially migrated using 'mergesegs', however segments will *
440	* require re-parsing (and consequently re-indexing). *
441	**************************** WARNING !!! ******************************
442
443	18. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
444	the algorithm. (ab)
445
446	19. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
447	find parser (siren)
448
449	20. NUTCH-379 - ParseUtil does not pass through the content's URL to the
450	ParserFactory (Chris A. Mattmann via siren)
451
452	21. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
453	partition. (ab)
454
455	22. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
456
457	23. NUTCH-395 - Increase fetching speed (siren)
458
459	24. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
460	(reported by Jared Dunne)
461
462	25. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
463
464	26. NUTCH-403 - Make URL filtering optional in Generator (siren)
465
466	27. NUTCH-405 - Content object is not properly initialized in map method
467	of ParseSegment (siren)
468
469	28. NUTCH-362 - Remove parse-text from unsupported filetypes in
470	parse-plugins.xml (siren)
471
472	29. NUTCH-305 - Update crawl and url filter lists to exclude
473	jpeg\|JPEG\|bmp\|BMP, suffix-urlfilter.txt (contributed by Stefan
474	Neufeind) is also updated (siren)
475
476	30. NUTCH-406 - Metadata tries to write null values (mattmann)
477
478	31. NUTCH-415 - Generator should mark selected records in CrawlDb.
479	Due to increased resource consumption this step is optional.
480	Application-level locking has been added to prevent concurrent
481	modification of databases. (ab)
482
483	32. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
484	now possible to correctly update CrawlDb from multiple segments.
485	Introduce new status codes for temporary and permanent
486	redirection. (ab)
487
488	33. NUTCH-322 - Fix Fetcher to store redirected pages and to store
489	protocol-level status. This also should fix NUTCH-273. (ab)
490
491	34. Change default Fetcher behavior not to follow redirects immediately.
492	Instead Fetcher will record redirects as new pages to be added to CrawlDb.
493	This also partially addresses NUTCH-273. (ab)
494
495	35. Detect and report when Generator creates 0-sized segments. (ab)
496
497	36. Fix Injector to preserve already existing CrawlDatum if the seed list
498	being injected also contains such URL. (ab)
499
500	37. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
501	skipping bad URLs. (Michael Stack via ab)
502
503	38. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
504	Filters that are not in plugin.includes (Stefan Groschupf, siren)
505
506	39. NUTCH-421 - Allow predeterminate running order of indexing filters
507	(Alan Tanaman, siren)
508
509	40. When indexing pages with redirection, drop all intermediate pages and
510	index only the final page. (ab)
511
512	41. Upgrade to Hadoop 0.10.1. (ab)
513
514	42. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
515	order in which IndexDoc-s are processed. (Dogacan Guney via ab)
516
517	43. NUTCH-428 - NullPointerException thrown when agent name is not
518	configured properly. Changed to throw RuntimeException instead.
519	(siren)
520
521	44. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
522
523	45. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
524
525	46. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
526	or indexing from hadoop.io.DataOutputBuffer (siren)
527
528	47. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
529
530	48. NUTCH-390 - Javadoc warnings (mattmann)
531
532	49. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
533
534	50. NUTCH-432 - Fix a bug where platform name with spaces would break the
535	bin/nutch script. (Brian Whitman via ab)
536
537	51. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
538
539	52. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
540
541	53. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
542	framework to operate properly (Heiko Dietze via mattmann)
543
544	54. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
545	Groschupf via kubes)
546
547	55. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
548	path is empty (kubes)
549
550	56. Upgrade to Hadoop 0.12.1 release. (ab)
551
552	57. NUTCH-246 - Incorrect segment size being generated due to time
553	synchronization issue (Stefan Groschupf via ab)
554
555	58. Upgrade to Hadoop 0.12.2 release. (ab)
556
557	59. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
558	Stack and Dogacan Guney via kubes)
559
560	Release 0.8 - 2006-07-25
561
562	0. Totally new architecture, based on hadoop
563	[http://lucene.apache.org/hadoop] (cutting)
564
565	1. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
566
567	2. NUTCH-108 - Log hosts that exceed generate.max.per.host.
568	(Rod Taylor via cutting)
569
570	3. NUTCH-88 - Enhance ParserFactory plugin selection policy
571	(jerome)
572
573	4. NUTCH-124 - Protocol-httpclient does not follow redirects when
574	fetching robots.txt (cutting)
575
576	5. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
577	(stack@archive.org, cutting)
578
579	6. NUTCH-114 - Getting number of urls and links from crawldb
580	(Stefan Groschupf via ab)
581
582	7. NUTCH-112 - Link in cached.jsp page to cached content is an
583	absolute link (Chris A. Mattmann via jerome)
584
585	8. NUTCH-135 - Http header meta data are case insensitive in the
586	real world (Stefan Groschupf via jerome)
587
588	9. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
589	to UTF-8 BOM (KuroSaka TeruHiko via siren)
590
591	10. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
592
593	11. Added support for OpenSearch (cutting)
594
595	12. NUTCH-142 - NutchConf should use the thread context classloader
596	(Mike Cannon-Brookes via pkosiorowski)
597
598	13. NUTCH-160 - Use standard Java Regex library rather than
599	org.apache.oro.text.regex (Rod Taylor via cutting)
600
601	14. NUTCH-151 - CommandRunner can hang after the main thread exec is
602	finished and has inefficient busy loop (Paul Baclace via cutting)
603
604	15. NUTCH-174 - Problem encountered with ant during compilation
605
606	16. NUTCH-190 - ParseUtil drops reason for failed parse
607	(stack@archive.org via ab)
608
609	17. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
610
611	18. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
612
613	19. NUTCH-178 - in search.jsp must be session creation "false"
614	(YourSoft via siren)
615
616	20. NUTCH-200 - OpenSearch Servlet ist broken
617	(Marko Bauhardt via siren)
618
619	21. NUTCH-81 - Webapp only works when deployed in root
620	(AJ Banck, Michael Nebel via siren)
621
622	22. NUTCH-139 - Standard metadata property names in the ParseData
623	metadata (Chris A. Mattmann, jerome)
624
625	23. NUTCH-192 - Meta data support for CrawlDatum
626	(Stefan Groschupf via ab)
627
628	24. NUTCH-52 - Parser plugin for MS Excel files
629	(Rohit Kulkarni via jerome)
630
631	25. NUTCH-53 - Parser plugin for Zip files
632	(Rohit Kulkarni via jerome)
633
634	26. NUTCH-137 - footer is not displayed in search result page
635	(KuroSaka TeruHiko via siren)
636
637	27. NUTCH-118 - FAQ link points to invalid URL
638	(Steve Betts via siren)
639
640	28. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
641	translation (Ivan Sekulovic via siren)
642
643	29. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
644	via cutting)
645
646	30. NUTCH-140 - Add alias capability in parse-plugins.xml file that
647	allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
648
649	31. NUTCH-214 - Added Links to web site to search mailling list
650	(Jake Vanderdray via jerome)
651
652	32. NUTCH-204 - Multiple field values in HitDetails
653	(Stefan Groschupf via jerome)
654
655	33. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
656	to -1 to be consistent with http (jerome)
657
658	34. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
659
660	35. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
661	pkosiorowski)
662
663	36. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
664	jerome)
665
666	37. NUTCH-229 - Improved handling of plugin folder configuration
667	(Stefan Groschupf via ab)
668
669	38. NUTCH-206 - Search server throws InstantiationException (ab)
670
671	39. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
672	via ab)
673
674	40. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
675
676	41. Update to lucene 1.9.1 (cutting)
677
678	42. NUTCH-235 - Duplicate Inlink values (ab)
679
680	43. NUTCH-234 - Clustering extension code cleanups and a real
681	JUnit test case for the current implementation (Dawid Weiss via ab)
682
683	44. NUTCH-210 - Context.xml file for Nutch web application
684	(Chris A. Mattmann via jerome)
685
686	45. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
687
688	46. NUTCH-232 - Search.jsp has multiple search forms creating
689	invalid html / incorrect focus function (jerome)
690
691	47. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
692
693	48. NUTCH-244 - Inconsistent handling of property values
694	boundaries / unable to set db.max.outlinks.per.page to
695	infinite (jerome)
696
697	49. NUTCH-245 - DTD for plugin.xml configuration files
698	(Chris A. Mattmann via jerome)
699
700	50. NUTCH-250 - Generate to log truncation caused by
701	generate.max.per.host (Rod Taylor via cutting)
702
703	51. NUTCH-125 - OpenOffice Parser plugin (ab)
704
705	52. Switch from using java.io.File to org.apache.hadoop.fs.Path.
706	(cutting)
707
708	53. NUTCH-240 - Scoring API: extension point, scoring filters and
709	an OPIC plugin (ab)
710
711	54. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
712
713	55. NUTCH-268 - Generator and lib-http use different definitions of
714	"unique host" (ab)
715
716	56. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
717	via siren)
718
719	57. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
720	(Dennis Kubes via ab)
721
722	58. NUTCH-201 - Add support for subcollections
723	(siren)
724
725	59. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
726	(Stefan Groschupf via jerome)
727
728	60. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
729
730	61. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
731	(Stefan Groschupf via jerome)
732
733	62. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
734	(stack@archive.org via siren)
735
736	63. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
737	(Stefan Neufeind via siren)
738
739	64. NUTCH-307 - Wrong configured log4j.properties (jerome)
740
741	65. NUTCH-303 - Logging improvements (jerome)
742
743	66. NUTCH-308 - Maximum search time limit (ab)
744
745	67. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
746	problem (Grant Glouser via siren)
747
748	68. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
749
750	69. NUTCH-317 - Clarify what the queryLanguage argument of
751	Query.parse(...) means (jerome)
752
753	70. Added alternative experimental web gui in contrib containing
754	extensions like subcollection, keymatch, user preferences,
755	caching, implemented mainly using tiles and jstl (siren)
756
757	71. NUTCH-320 DmozParser does not output list of urls to stdout
758	but to a log file instead. Original functionality restored.
759
760	72. NUTCH-271 - Add ability to limit crawling to the set of initially
761	injected hosts (db.ignore.external.links) (Philippe Eugene,
762	Stefan Neufeind via ab)
763
764	73. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
765
766	74. NUTCH-327 - Fixed logging directory on cygwin (siren)
767
768	Release 0.7 - 2005-08-17
769
770	1. Added support for "type:" in queries. Search results are limited/qualified
771	by mimetype or its primary type or sub type. For example,
772	(1) searching with "type:application/pdf" restricts results
773	to pages which were identified to be of mimetype "application/pdf".
774	(2) with "type:application", nutch will return pages of
775	primary type "application".
776	(3) with "type:pdf", only pages of sub type "pdf" will be listed.
777	(John Xing, 20050120)
778
779	2. Added support for "date:" in queries. Last-Modified is indexed.
780	Search results are restricted by lower and upper date (inclusive)
781	as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
782	only returns pages with Last-Modified in year 2004.
783	(John Xing, 20050122)
784
785	3. Add URLFilter plugin interface and convert existing url filters into
786	plugins. (John Xing, 20050206)
787
788	4. Add UpdateSegmentsFromDb tool, which updates the scores and
789	anchors of existing segments with the current values in the web
790	db. This is used by CrawlTool, so that pages are now only fetched
791	once per crawl. (Doug Cutting, 20050221)
792
793	5. Moved code into org.apache.nutch sub-packages. Changed license to
794	Apache 2.0. Removed jar files whose licenses do not permit
795	redistribution by Apache. Disabled compilation of plugins which
796	require these libraries. (Doug Cutting 20050301)
797
798	6. Index host and title in separate fields. Host was indexed
799	previously only as a part of the URL. Title was indexed as an
800	anchor. Now boosts for matching these fields may be adjusted
801	separately from boosts for matching anchors and url. Also: move
802	site indexing to index-basic plugin to minimize the number of
803	times the URL needs to be parsed; and, stop using anchor analyzer
804	for anything but anchors. (Piotr Kosiorowski via Doug Cutting
805	20050323)
806
807	7. Add servlet Cached.java that serves cached Content of any mime type.
808	Slightly modified are web.xml and cached.jsp.
809	(John Xing, 20050401)
810
811	8. Add skipCompressedByteArray() to WritableUtils.java.
812	(John Xing, 20050402)
813
814	9. Fixes to jsp and static web pages. These now use relative links,
815	so that the Nutch webapp file can be used in places other than at
816	the root. Also fixed links to the about and help pages. Bug #32.
817	(Jerome Charron via cutting, 20050404)
818
819	10. Added some features to DistributedSearch: new segments can be added
820	to searchservers without restarting the frontend, defective search
821	servers are not queried until tey come back online, watchdog keeps
822	an eye for your searchservers and writes simple statistics.
823	(Sami Siren, 20050407)
824
825	11. Fix for bug #4 - Unbalanced quote in query eats all resources.
826	(Piotr Kosiorowski, Sami Siren, 20050407)
827
828	12. Close Issue #33 - MIME content type detector (using magic char sequences).
829	(Jerome Charron and Hari Kodungallur via John Xing, 20050416)
830
831	13. Add a servlet that implements A9's OpenSearch RSS web service.
832	(cutting, 20050418)
833
834	14. Remove references to link analysis from tutorial, and enable
835	scoring by link count when generating fetchlists and searching.
836	(cutting, 20040419)
837
838	15. Make query boosts for host, title, anchor and phrase matches
839	configurable. (Piotr Kosiorowski via cutting, 20050419)
840
841	16. Add support for sorting search results and search-time deduping by
842	fields other than site.
843
844	17. Automatically convert range queries into cached range filters.
845	This improves the performance and scalability of, e.g., date range
846	searching.
847
848	18. Several methods have been renamed due to misspellings. The old
849	methods have been deprecated and will be removed before the 1.0
850	release.
851
852
853	Release 0.6
854
855	1. Added clustering-carrot2 plugin, together with introduction of clustering
856	api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
857
858	2. Make a number of changes to NDFS (Nutch Distributed File System)
859	to fix bugs, add admin tools, etc.
860
861	Also, modify all command line tools so you can indicate whether to
862	use NDFS or the local filesystem. If you indicate nothing, then
863	it defaults to the local fs.
864
865	I've used this to do a 35m page crawl via NDFS, distributed over a
866	dozen machines. (Mike Cafarella)
867
868	3. Add support for BASE tags in HTML. Outlinks are now correctly
869	extracted when a BASE tag is present. (cutting)
870
871	4. Fix two bugs in result pagination. When the last hit on a page
872	was the last hit overall, the "next" button was sometimes shown
873	when the "show all" button should be shown instead. Also, in
874	certain cases, the "show all" button would be shown when the
875	"next" button should have been shown. (cutting)
876
877	5. Add config parameter "indexer.max.tokens" that determines the
878	maximum number of tokens indexed per field. (Andy Hedges via cutting)
879
880	6. Add parser for mp3 files. (Andy Hedges via cutting)
881
882	7. Add RegexUrlNormalizer. This is useful for things like stripping
883	out session IDs from URLs. To use it, add values for
884	urlnormalizer.class and urlnormalizer.regex.file to your
885	nutch-site.xml. The RegexUrlNormalizer class extends the
886	BasicUrlNormalizer, and does basic normalization as well.
887	(Luke Baker via cutting)
888
889	8. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
890
891	9. Added Polish translation (Andrzej Bialecki, 20040911)
892
893	10. Added 3 more language profiles to language identifier (ru,hu,pl).
894	Other changes to language identifier: Porfiles converted to utf8,
895	added some test cases, changed the similarity calculation.
896	(Sami Siren, 20040925)
897
898	11. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
899
900	12. Added plugin index-more and more.jsp (John Xing, 20041003)
901
902	13. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
903	in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
904
905	14. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
906	(but not search.jsp) with NullPointerException in distributed search.
907	It seems that this bug appears after "hits per site" stuff is added.
908	The fix is done in Hit.java, making sure String site is never null.
909	Hope this fix not have bad effetct on "hits per site" code.
910	(John Xing, 20041006)
911
912	15. Fixed a bug that fails fullyDelete() in FileUtil.java for
913	LocalFileSystem.java. This bug also exposes possible incompleteness
914	of NDFSFile.java, where a few methods are not supported, including
915	delete(). Nothing changed in NDFSFile.java though. Leave it for future
916	improvement (John Xing, 20041022).
917
918	16. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
919	A new status code CANT_PARSE is added to FetcherOutput.java.
920	Without option -noParsing , no change in fetcher behavior. With
921	option -noParsing, fetcher does crawls only, no parsing is carried out.
922	Then, ParseSegment.java should be used to parse in separate pass.
923	(John Xing, 20041025)
924
925	17. Added ontology plugin. Currently it is used for query refinement, as
926	examplified in refine-query-init.jsp and refine-query.jsp. By default,
927	query refinement is disabled in search.jsp. Please check
928	./src/plugin/ontology/README.txt for further description.
929	Ontology plugin certainly can be used for many other things.
930	(Michael J. Pan via John Xing, 20041129)
931
932	18. Changed fetcher.server.delay to be a float, so that sub-second
933	delays can be specified. (cutting)
934
935	19. Added plugin.includes config parameter that determines which
936	plugins are included. By default now only http, html and basic
937	indexing and search plugins are enabled, rather than all plugins.
938	This should make default performance more predictable and reliable
939	going forward. (cutting)
940
941	20. Cleaned up some filesystem code, including:
942
943	- Replaced BufferedRandomAccessFile with two simpler utilties,
944	NFSDataInputStream and NFSDataOutputStream.
945
946	- Fixed the bug where SequenceFiles were no longer flushed when
947	created, so that, when fetches crashed, segments were
948	unreadable. Now segments are always readable after crashes.
949	Only the contents of the last buffer is lost.
950
951	- Simplified the FSOutputStream API to not include seek(). We
952	should never need that functionality.
953
954	- Simplified LocalFileSystem's implementations of FSInputStream
955	and FSOutputStream and optimized FSInputStream.seek().
956
957	(cutting)
958
959	21. Fixed BasicUrlNormalizer to better handle relative urls. The file
960	part of a URL is normalized in the following manner:
961
962	1. "/aa/../" will be replaced by "/" This is done step by step until
963	the url doesnÂ´t change anymore. So we ensure, that
964	"/aa/bb/../../" will be replaced by "/", too
965
966	2. leading "/../" will be replaced by "/"
967
968	(Sven Wende via cutting)
969
970	22. Fix Page constructors so that next fetch date is less likely to be
971	misconstrued as a float. This patches a problem in WebDBInjector,
972	where new pages were added to the db with nextScore set to the
973	intended nextFetch date. This, in turn, confused link analysis.
974
975	23. In ndfs code, replace addLocalFile(), putToLocalFile() with
976	copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
977	moveToLocalFile(). (John Xing, 20041217)
978
979	24. Added new config parameter fetcher.threads.per.host. This is used
980	by the Http protocol. When this is one behavior is as before.
981	When this is greater than one then multiple threads are permitted
982	to access a host at once. Note that fetcher.server.delay is no
983	longer consistently observed when this is greater than one.
984	(Luke Baker via Doug Cutting)
985
986	Release 0.5
987
988	1. Changed plugin directory to be a list of directories.
989
990	2. Permit Plugin to be the default plugin implementation.
991
992	3. Added pluggable interface for network protocols in new package
993	net.nutch.protocol. Moved http code from core into a plugin.
994
995	4. Added pluggable interface for content parsing in new package
996	net.nutch.parse. Moved html parsing code from core into a
997	plugin.
998
999	5. Fixed a bug in NutchAnalysis where 16-bit characters were not
1000	processed correctly.
1001
1002	6. Fixed bug #971731: random summaries on result page.
1003	(Daniel Naber via cutting)
1004
1005	7. Made Nutch logo transparent. (Daniel Naber via cutting)
1006
1007	8. Added file protocol plugin. (John Xing via cutting)
1008
1009	9. Added ftp protocol plugin. (John Xing via cutting)
1010
1011	10. Added pdf and msword parser plugins. (John Xing via cutting)
1012
1013	11. Added pluggable indexing interface. By default, url, content,
1014	anchors and title are indexed, as before, but now one can easily
1015	alter this to, e.g., index metadata. A demonstration is provided
1016	which extracts and indexes Creative Commons license urls. (cutting)
1017
1018	12. Add language identification plugin.
1019
1020	The process of identification is as follows:
1021
1022	1. html (html only, HTML 4.0 "lang" attribute)
1023	2. meta tags (html only, http-equiv, dc.language)
1024	3. http header (Content-Language)
1025	4. if all above fail "statistical analysis"
1026
1027	1 & 2 are run during the fetching phase and 3 & 4 are run on
1028	indexing phase.
1029
1030	Currently supported languages (in "statistical analysis") are
1031	da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
1032	from http://www.isi.edu/~koehn/europarl/ and the profiles were
1033	build with tool supplied in patch.
1034
1035	After indexing the language can be found from field named "lang"
1036
1037	It's not 100% accurate but it's a start.
1038	(Sami Siren)
1039
1040	13. Added SegmentMergeTool and "mergesegs" command, to remove
1041	duplicated or otherwise not used content from several segments and
1042	joining them together into a single new segment. The tool also
1043	optionally performs several other steps required for proper
1044	operation of Nutch - such as indexing segments, deleting
1045	duplicates, merging indices, and indexing the new single segment.
1046	(Andrzej Bialecki)
1047
1048	14. Add the ability to retrieve ParseData of a search hit. ParseData
1049	contains many valuable properties of a search hit.
1050
1051	This is required (among others) to properly display the cached
1052	content because it's not possible to determine the character
1053	encoding from the output of the getContent() method (which returns
1054	byte[]). The symptoms are that for HTML pages using non-latin1 or
1055	non-UTF8 encodings the cached preview will almost certainly look
1056	broken. Using the attached patch it is possible to determine the
1057	character encoding from the ParseData (for HTTP: Content-Type
1058	metadata), and encode the content accordingly. (Andrzej Bialecki)
1059
1060	15. Add a pluggable query interface. By default, the content, anchor
1061	and url fields are searched as before. A sample plugin indexes
1062	the host name and adds a "site:" keyword to query parsing.
1063
1064	16. Added support for "lang:" in queries. For example, searching with
1065	"lang:en" restricts results to pages which were identified to
1066	be in English.
1067
1068	17. Automatically optimize field queries to use cached Lucene filters.
1069	This makes, for example, searches restricted by languages or sites
1070	that are very common much faster.
1071
1072	18. Improved charset handling in jsp pages. (jshin by cutting)
1073
1074	19. Permit topic filtering when injecting DMOZ pages. (jshin by cutting)
1075
1076	20. When parsing crawled pages, interpret charset specifications in
1077	html meta tags. (jshin by cutting)
1078
1079	21. Added support for "cc:licensed" in queries, which searches for documents
1080	released under Creative Commons licenses. Attributes of the
1081	license may also be queried, with, e.g., "cc:by" for
1082	attribution-required licenses, "cc:nc" for non-commercial
1083	licenses, etc.
1084
1085	22. Relative paths named in plugin.folders are now searched for on the
1086	classpath. This makes, e.g., deployment in a war file much simpler.
1087
1088	23. Modifications to Fetcher.java.
1089
1090	1. Make sure it works properly with regard to creation and initialization
1091	of plugin instances. The problem was that multiple threads race to
1092	startUp() or shutDown() plugin instances. It was solved by synchronizing
1093	certain codes in PluginRepository.java and Extension.java.
1094	(Stefan Groschupf via John Xing)
1095
1096	2. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
1097	may never return (quit) if there are still data or other structures
1098	(e.g., persistent socket connections) associated with plugins. (John Xing)
1099
1100	3. Fixed one type of Fetcher "hang" problems by monitoring named
1101	FetcherThreads. If all FetcherThreads are gone (finished),
1102	Fetcher.java is considered done. The problem was: there could be
1103	runaway threads started by external libs via FetcherThreads.
1104	Those threads never return, thus keep Fetcher from exiting normally.
1105	(John Xing)
1106
1107	24. Eliminate excessive hits from sites. This is done efficiently by
1108	adding the site name to Hit instances, and, when needed,
1109	re-querying with too-frequent sites prohibited in the query.
1110
1111
1112	Release 0.4
1113
1114	1. Http class refactored. (Kevin Smith via Tom Pierce)
1115
1116	2. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
1117
1118	3. Added Japanese translation. (Yukio Andoh via Doug Cutting)
1119
1120	4. Updated Dutch translation. (Ype Kingma via Doug Cutting)
1121
1122	5. Initial version of Distributed DB code. (Mike Cafarella)
1123
1124	6. Make things more tolerant of crashed fetcher output files.
1125	(Doug Cutting)
1126
1127	7. New skin for website. (Frank Henze via Doug Cutting)
1128
1129	8. Added Spanish translation. (Diego Basch via Doug Cutting)
1130
1131	9. Add FTP support to fetcher. (John Xing via Doug Cutting)
1132
1133	10. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
1134
1135	11. Added Robots.txt & throttling support to Fetcher.java. (Mike
1136	Cafarella)
1137
1138	12. Added nightly build. (Doug Cutting)
1139
1140	13. Default all link scores to 1.0. (Doug Cutting)
1141
1142	14. Permit one to keep internal links. (Doug Cutting)
1143
1144	15. Fixed dedup to select shortest URL. (Doug Cutting)
1145
1146	16. Changed index merger so that merged index is written to named
1147	directory, rather than to a generated name in that directory.
1148	(Doug Cutting)
1149
1150	17. Disable coordination weighting of query clauses and other minor
1151	scoring improvements. (Doug Cutting)
1152
1153	18. Added a new command, crawl, that constructs a database, injects a
1154	url file and performs a few rounds of generate/fetch/updatedb.
1155	This simplifies use for intranet sites. Changed some defaults to
1156	be more intranet friendly. (Doug Cutting)
1157
1158	19. Fixed a bug where Fetcher.java didn't construct correct relative
1159	links when a page was redirected. (Doug Cutting)
1160
1161	20. Fixed a query parser problem with lookahead over plusses and minuses.
1162	(Doug Cutting)
1163
1164	21. Add support for HTTP proxy servers. (Sami Siren via Doug Cutting)
1165
1166	22. Permit searching while fetching and/or indexing.
1167	(Sami Siren via Doug Cutting)
1168
1169	23. Fix a bug when throttling is disabled. (Sami Siren via Doug Cutting)
1170
1171	24. Updated Bahasa Malaysia translation. (Michael Lim via Doug Cutting)
1172
1173	25. Added Catalan translation. (Xavier Guardiola via Doug Cutting)
1174
1175	26. Added brazilian portuguese translation.
1176	(A. Moreir via Doug Cutting)
1177
1178	27. Added a french translation. (Julien Nioche via Doug Cutting)
1179
1180	28. Updated to Lucene 1.4RC3. (Doug Cutting)
1181
1182	29. Add capability to boost by link count & use it in crawl tool.
1183	(Doug Cutting)
1184
1185	30. Added plugin system. (Stefan Groschupf via Doug Cutting)
1186
1187	31. Add this change log file, for recording significant changes to
1188	Nutch. Populate it with changes from the last few months.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: