Context Navigation

← Previous Changeset
Next Changeset →

Changeset 73

Timestamp:

Jun 2, 2009, 5:39:43 PM (16 years ago)

Author:

waue

Message:

the version can make deb

Location:

nutchez-0.1

Files:

: 4 edited

CHANGES.txt (modified) (1 diff)
debian/control (modified) (1 diff)
debian/files (modified) (1 diff)
debian/nutchez.install (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

nutchez-0.1/CHANGES.txt

-                      r66
+                      r73
 Nutch Change Log
+nutchez Change Log
-Release 1.0 - 2009-03-23
-. NUTCH-474 - Fetcher2 crawlDelay and blocking fix (Dogacan Guney via ab)
-. NUTCH-443 - Allow parsers to return multiple Parse objects.
-    (Dogacan Guney et al, via ab)
-. NUTCH-393 - Indexer should handle null documents returned by filters.
-    (Eelco Lempsink via ab)
-. NUTCH-456 - Parse msexcel plugin speedup (Heiko Dietze via siren)
-. NUTCH-446 - RobotRulesParser should ignore Crawl-delay values of other
-    bots in robots.txt (Dogacan Guney via siren)
-. NUTCH-482 - Remove redundant plugin lib-log4j (siren)
-. NUTCH-483 - Remove redundant commons-logging jar from ontology plugin
-    (siren)
-. NUTCH-161 - Change Plain text parser to
-    use parser.character.encoding.default property for fall back encoding
-    (KuroSaka TeruHiko, siren)
-. NUTCH-61 - Support for adaptive re-fetch interval and detection of
-    unmodified content. (ab)
-. NUTCH-392 - OutputFormat implementations should pass on Progressable.
-    (cutting via ab)
-. NUTCH-495 - Unnecessary delays in Fetcher2 (dogacan)
-. NUTCH-443 - allow parsers to return multiple Parse object, this will speed
-    up the rss parser (dogacan via mattmann). This update is a fix and semantics
-    change from the original patch for NUTCH-443. The original patch did not tell
-    the  Indexer to read crawl_parse too so that it can pickup sub-urls' fetch
-    datums. This patch addresses that issue. Now, if Fetcher gets a null content,
-    instead of pushing an empty content, it filters the null content.
-. NUTCH-485 - Change HtmlParseFilter 's to return ParseResult object instead of
-    Parse object. (Gal Nitzan via dogacan)
-. NUTCH-489 - URLFilter-suffix management of the url path when the url contains
-    some query parameters. (Emmanuel Joke via dogacan)
-. NUTCH-502 - Bug in SegmentReader causes infinite loop.
-    (Ilya Vishnevsky via dogacan)
-. NUTCH-444 Possibly use a different library to parse RSS feed for improved
-    performance and compatibility. This patch introduced a new plugin, feed,
-    that includes an index filter and a parse plugin for feeds that uses ROME.
-    There was discussion to remove parse-rss, in light of the feed plugin,
-    however, this patch does not explicitly remove parse-rss. (dogacan, mattmann)
-. NUTCH-471 - Fix synchronization in NutchBean creation.
-    (Enis Soztutar via dogacan)
-. Upgrade to Lucene 2.2.0 and Hadoop 0.12.3. (ab)
-. NUTCH-468 - Scoring filter should distribute score to all outlinks at
-    once. (dogacan)
-. NUTCH-504 - NUTCH-443 broke parsing during fetching. (dogacan)
-. NUTCH-497 -  Extreme Nested Tags causes StackOverflowException in
-  DomContentUtils...Spider Trap. (kubes)
-. NUTCH-434 - Replace usage of ObjectWritable with something based on
-    GenericWritable. (dogacan)
-. NUTCH-499 - Refactor LinkDb and LinkDbMerger to reuse code. (dogacan)
-. NUTCH-498 - Use Combiner in LinkDb to increase speed of linkdb generation.
-    (Espen Amble Kolstad via dogacan)
-. NUTCH-507 - lib-lucene-analyzers jar defintion is wrong in plugin.xml.
-    (Emmanuel Joke via dogacan)
-. NUTCH-503 - Generator exits incorrectly for small fetchlists.
-    (Vishal Shah via dogacan)
-. NUTCH-505 - Outlink urls should be validated. (dogacan)
-. NUTCH-510 - IndexMerger delete working dir. (Enis Soztutar via dogacan)
-. NUTCH-513 - suffix-urlfilter.txt does not have a template. (dogacan)
-. NUTCH-515 - Next fetch time is set incorrectly. (dogacan)
-. NUTCH-506 - Nutch should delegate compression to Hadoop. (dogacan)
-. NUTCH-517 - build encoding should be UTF-8. (Enis Soztutar via dogacan).
-. NUTCH-518 - Fix OpicScoringFilter to respect scoring filter chaining.
-    (Enis Soztutar via dogacan)
-. NUTCH-516 - Next fetch time is not set when it is a
-    CrawlDatum.STATUS_FETCH_GONE. (Emmanuel Joke via dogacan)
-. NUTCH-525 - DeleteDuplicates generates ArrayIndexOutOfBoundsException
-    when trying to rerun dedup on a segment. (Vishal Shah via dogacan)
-. NUTCH-514 - Indexer should only index pages with fetch status SUCCESS.
-    (dogacan) Note: There is a bigger problem, i.e how to deal
-    with redirected pages, and this issue can be considered as a band-aid
-    for the time being. See NUTCH-273 and NUTCH-353 for more details.
-. NUTCH-533 - LinkDbMerger: url normalized is not updated in the key and
-    inlinks list. (Emmanuel Joke via dogacan)
-. NUTCH-535 -ParseData's contentMeta accumulates unnecessary values during
-    parse. (dogacan)
-. NUTCH-522 - Use URLValidator in the Injector. (Emmanuel Joke, dogacan)
-. NUTCH-536 - Reduce number of warnings in nutch core. (dogacan)
-. NUTCH-439 - Top Level Domains Indexing / Scoring. Also adds
-    domain-related utilities. (Enis Soztutar via dogacan)
-. NUTCH-544 - Upgrade Carrot2 clustering plugin to the newest stable
-    release (2.1). (Dawid Weiss via dogacan)
-. NUTCH-545 - Configuration and OnlineClusterer get initialized in every
-    request. (Dawid Weiss via dogacan)
-. NUTCH-532 - CrawlDbMerger: wrong computation of last fetch time.
-    (Emmanuel Joke via dogacan)
-. NUTCH-550 - Parse fails if db.max.outlinks.per.page is -1. (dogacan)
-. NUTCH-546 - file URL are filtered out by the crawler. (dogacan)
-. NUTCH-554 - Generator throws IOException on invalid urls.
-    (Brian Whitman via ab)
-. NUTCH-529 - NodeWalker.skipChildren doesn't work for more than 1 child.
-    (Emmanuel Joke via dogacan)
-. NUTCH-25 - needs 'character encoding' detector.
-    (Doug Cook, dogacan, Marcin Okraszewski, Renaud Richardet via dogacan)
-. NUTCH-508 - ${hadoop.log.dir} and ${hadoop.log.file} are not propagated
-    to the tasktracker. (Mathijs Homminga, Emmanuel Joke via dogacan)
-. NUTCH-562 - Port mime type framework to use Tika mime detection framework.
-    (mattmann)
-. NUTCH-488 - Avoid parsing uneccessary links and get a more relevant outlink
-    list. (Emmanuel Joke, Marcin Okraszewski via kubes)
-. NUTCH-501 -  Implement a different caching mechanism for objects cached in
-    configuration. (dogacan)
-. NUTCH-552 - Upgrade Nutch to Hadoop 0.15.x. (kubes)
-. NUTCH-565 - Arc File to Nutch Segments Converter. (kubes)
-. NUTCH-547 - Redirection handling: YahooSlurp's algorithm.
-    (dogacan, kubes via dogacan)
-. NUTCH-548 - Move URLNormalizer from Outlink to ParseOutputFormat.
-    (Emmanuel Joke via dogacan)
-. NUTCH-538 - Delete unused classes under o.a.n.util. (dogacan)
-. NUTCH-494 - FindBugs: CrawlDbReader and DeleteDuplicates. (dogacan)
-. NUTCH-574 - Including inlink anchor text in index can create irrelevant
-    search results.  Created index-anchor plugin, removed functionality from
-    index-basic plugin. For backwards compatibility, add index-anchor plugin to
-    nutch-site.xml plugin.includes. (kubes)
-. NUTCH-581 - DistributedSearch does not update search servers added to
-    search-servers.txt on the fly.  (Rohan Mehta via kubes)
-. NUTCH-586 - Add option to run compiled classes without job file
-    (enis via ab)
-. NUTCH-559 - NTLM, Basic and Digest Authentication schemes for web/proxy
-    server. (Susam Pal via dogacan)
-. NUTCH-534 - SegmentMerger: add -normalize option (Emmanuel Joke via ab)
-. NUTCH-528 - CrawlDbReader: add some new stats + dump into a CSV format
-    (Emmanuel Joke via ab)
-. NUTCH-597 - NPE in Fetcher2 (Remco Verhoef via ab)
-. NUTCH-584 - urls missing from fetchlist (Ruslan Ermilov, ab)
-. NUTCH-580 - Remove deprecated hadoop api calls (FS) (siren)
-. NUTCH-587 - Upgrade to Hadoop 0.15.3 (kubes)
-. NUTCH-604 - Upgrade to Lucene 2.3.0 (ab)
-. NUTCH-602 - Allow configurable number of handlers for search servers
-    (hartbecke via kubes)
-. NUTCH-607 - Update build.xml to include tika jar when building war (kubes)
-. NUTCH-608 - Upgrade nutch to use released apache-tika-0.1-incubating (mattmann)
-. NUTCH-606 - Refactoring of Generator, run all urls through checks (kubes)
-. NUTCH-605 - Change deprecated configuration methods for Hadoop (kubes)
-. NUTCH-603 - Add more default url normalizations (kubes)
-. NUTCH-611 - Upgrade Nutch to use Hadoop 0.16 (kubes)
-. NUTCH-44 - Too many search results, limits max results returned from a
-    single search. (Emilijan Mirceski and Susam Pal via kubes)
-. NUTCH-567 - Proper (?) handling of URIs in TagSoup. TagSoup library is
-    updated to 1.2 version. (dogacan)
-. NUTCH-613 - Empty summaries and cached pages (kubes via ab)
-. NUTCH-612 - URL filtering was disabled in Generator when invoked
-    from Crawl (Susam Pal via ab)
-. NUTCH-601 - Recrawling on existing crawl directory (Susam Pal via ab)
-. NUTCH-575 - NPE in OpenSearchServlet (John H. Lee via ab)
-. NUTCH-126 - Fetching https does not work with a proxy (Fritz Elfert via ab)
-. NUTCH-615 - Redirected URL-s fetched without setting fetchInterval.
-    Guard against reprUrl being null. (Emmanuel Joke, ab)
-. NUTCH-616 - Reset Fetch Retry counter when fetch is successful (Emmanuel
-    Joke, ab)
-. NUTCH-220 - Upgrade to PDFBox 0.7.3 (ab)
-. NUTCH-223 - Crawl.java uses Integer.MAX_VALUE (Jeff Ritchie via ab)
-. NUTCH-598 - Remove deprecated use of ToolBase. Use generics in Hadoop API.
-    (Emmanuel Joke, dogacan, ab)
-. NUTCH-620 - BasicURLNormalizer should collapse runs of slashes with a
-    single slash. (Mark DeSpain via ab)
-. NUTCH-500 - Add hadoop masters configuration file into conf folder.
-    (Emmanuel Joke via kubes)
-. NUTCH-596 - ParseSegments parse content even if its not
-    CrawlDatum.STATUS_FETCH_SUCCESS (dogacan)
-. NUTCH-618 - Tika error "Media type alias already exists" (mattmann,kubes)
-. NUTCH-634 - Upgrade Nutch to Hadoop 0.17.1 (Michael Gottesman, Lincoln
-    Ritter, ab)
-. NUTCH-641 - IndexSorter inorrectly copies stored fields (ab)
-. NUTCH-645 - Parse-swf unit test failing (ab)
-. NUTCH-642 - Unit tests fail when run in non-local mode (ab)
-. NUTCH-639 - Change LuceneDocumentWrapper visibility from
-    private to _public_ (Guillaume Smet via dogacan)
-. NUTCH-651 - Remove bin/{start|stop}-balancer.sh from svn
-    tracking. (dogacan)
-. NUTCH-375 - Add support for Content-Encoding: deflated
-    (Pascal Beis, ab)
-. NUTCH-633 - ParseSegment no longer allow reparsing.
-     (dogacan)
-. NUTCH-653 - Upgrade to hadoop 0.18. (dogacan)
-. NUTCH-621 - Nutch needs to declare it's crypto usage (mattmann)
-. NUTCH-654 - urlfilter-regex's main does not work.
-     (dogacan)
-. NUTCH-640 - confusing description "set it to Integer.MAX_VALUE".
-     (dogacan)
-. NUTCH-662 - Upgrade Nutch to use Lucene 2.4. (kubes)
-. NUTCH-663 - Upgrade Nutch to use Hadoop 0.19 (kubes)
-. NUTCH-647 - Resolve URLs tool (kubes)
-. NUTCH-665 - Search Load Testing Tool (kubes)
-. NUTCH-667 - Input Format for working with Content in Hadoop Streaming
-                 (kubes)
-. NUTCH-635 -  LinkAnalysis Tool for Nutch. (kubes)
-. NUTCH-646 -  New Indexing Framework for Nutch. (kubes)
-. NUTCH-668 -  Domain URL Filter. (kubes)
-. NUTCH-594 -  Serve Nutch search results in multiple formats including
-                  XML and JSON. (kubes)
-. NUTCH-442 - Integrate Solr/Nutch. (dogacan, original version by siren)
-. NUTCH-652 - AdaptiveFetchSchedule#setFetchSchedule doesn't calculate
-                 fetch interval correctly. (dogacan)
-. NUTCH-627 - Minimize host address lookup (Otis Gospodnetic)
-. NUTCH-678 - Hadoop 0.19 requires an update of jets3t.
-                 (julien nioche via dogacan)
-. NUTCH-681 - parse-mp3 compilation problem.
-                 (Wildan Maulana via dogacan)
-. NUTCH-676 - MapWritable is written inefficiently and confusingly.
-                 (dogacan)
-. NUTCH-579 - Feed plugin only indexes one post per feed due to identical
-                 digest. (dogacan)
-. NUTCH-571 - parse-mp3 plugin doesn't always index album of mp3.
-                 (Joseph Chen, dogacan)
-. NUTCH-682 - SOLR indexer does not set boost on the document.
-                 (julien nioche via dogacan)
-. NUTCH-279 - Additions to urlnormalizer-regex (Stefan Neufeind, ab)
-. NUTCH-671 - JSP errors in Nutch searcher webapp (Edwin Chu via ab)
-. NUTCH-643 - ClassCastException in PDF parser (Guillaume Smet, ab)
-. NUTCH-636 - Httpclient plugin https doesn't work on IBM JRE
-     (Curtis d'Entremont, ab)
-. NUTCH-683 - NUTCH-676 broke CrawlDbMerger. (dogacan)
-. NUTCH-631 - MoreIndexingFilter fails with NoSuchElementException
-     (Stefan Will, siren)
-. NUTCH-691 - Update jakarta poi jars to the most relevant version
-     (Dmitry Lihachev via siren)
-. NUTCH-563 - Include custom fields in BasicQueryFilter
-     (Julien Nioche via siren)
-. NUTCH-695 - Incorrect mime type detection by MoreIndexingFilter plugin
-     (Dmitry Lihachev via siren)
-. NUTCH-694 - Distributed Search Server fails (siren)
-. NUTCH-626 - Fetcher2 breaks out the domain with db.ignore.external.links
-     set at cross domain redirects (Remco Verhoef, dogacan via siren)
-. NUTCH-247 - Robot parser to restrict (kubes, siren)
-. NUTCH-698 - CrawlDb is corrupted after a few crawl cycles (dogacan
-     via siren)
-. NUTCH-699 - Add an "official" solr schema for solr integration (dogacan,
-     Dmitry Lihachev via siren)
-. NUTCH-703 - Upgrade to Hadoop 0.19.1 (ab)
-. NUTCH-419 - Unavailable robots.txt kills fetch (Carsten Lehmann,
-     Doug Cook via ab)
-. NUTCH-700 - Neko1.9.11 goes into a loop (Julien Nioche, siren)
-. NUTCH-669 - Consolidate code for Fetcher and Fetcher2 (siren)
-. NUTCH-711 - Indexer failing after upgrade to Hadoop 0.19.1 (ab)
-. NUTCH-684 - Dedup support for Solr. (dogacan)
-. NUTCH-715 - Subcollection plugin doesn't work with default
-     subcollections.xml file (Dmitry Lihachev via siren)
-. NUTCH-722 - Nutch contains JAI jars that we cannot redistribute
-Release 0.9 - 2007-04-02
-. Changed log4j confiquration to log to stdout on commandline
-    tools (siren)
-. NUTCH-344 - Fix for thread blocking issue (Greg Kim via siren)
-. NUTCH-260 - Update hadoop version to 0.5.0 (Renaud Richardet,
-    siren)
-. Optionally skip pages with abnormally large values of Crawl-Delay
-    (Dennis Kubes via ab)
-. Change readdb -stats to use CombiningCollector (ab)
-. NUTCH-348 - Fix Generator to select highest scoring pages (Chris
-    Schneider and Stefan Groschupf via ab)
-. NUTCH-347 - Adjust plugin build script not to emit warnings when copying
-    dependant jars (siren)
-. NUTCH-338 - Remove the text parser as an option for parsing PDF files
-    in parse-plugins.xml (Chris A. Mattmann via siren)
-. NUTCH-105 - Network error during robots.txt fetch causes file to
-    be ignored (Greg Kim via siren)
-. NUTCH-367 - DistributedSearch thown ClassCastException (siren)
-. NUTCH-332 - Fix the problem of doubling scores caused by links pointing
-    to the current page (e.g. anchors). (Stefan Groschupf via ab)
-. NUTCH-365 - Flexible URL normalization (ab)
-. NUTCH-336 - Differentiate between newly discovered pages and newly
-    injected pages (Chris Schneider via ab) NOTE: this changes the
-    scoring API, filter implementations need to be updated.
-. NUTCH-337 - Fetcher ignores the fetcher.parse value (Stefan Groschupf
-    via ab)
-. NUTCH-350 - Urls blocked by http.max.delays incorrectly marked as GONE
-    (Stefan Groschupf via ab)
-. NUTCH-374 - when http.content.limit be set to -1 and
-    Response.CONTENT_ENCODING is gzip or x-gzip , it can not fetch any thing
-    (King Kong via pkosiorowski)
-. NUTCH-383 - upgrade to Hadoop 0.7.1 and Lucene 2.0.0. (ab)
-  ****************************** WARNING !!! ********************************
-  * This upgrade breaks data format compatibility. A tool 'convertdb'       *
-  * was added to migrate existing CrawlDb-s to the new format. Segment data *
-  * can be partially migrated using 'mergesegs', however segments will      *
-  * require re-parsing (and consequently re-indexing).                      *
-  ****************************** WARNING !!! ********************************
-. NUTCH-371 - DeleteDuplicates now correctly implements both parts of
-    the algorithm. (ab)
-. NUTCH-391 - ParseUtil logs file contents to log file when it cannot
-    find parser (siren)
-. NUTCH-379 - ParseUtil does not pass through the content's URL to the
-    ParserFactory (Chris A. Mattmann via siren)
-. NUTCH-361, NUTCH-136 - When jobtracker is 'local' generate only one
-    partition. (ab)
-. NUTCH-399 - Change CommandRunner to use concurrent api from jdk (siren)
-. NUTCH-395 - Increase fetching speed (siren)
-. NUTCH-388 - nutch-default.xml has outdated example for urlfilter.order
-    (reported by Jared Dunne)
-. NUTCH-404 - Fix LinkDB Usage - implementation mismatch (siren)
-. NUTCH-403 - Make URL filtering optional in Generator (siren)
-. NUTCH-405 - Content object is not properly initialized in map method
-    of ParseSegment (siren)
-. NUTCH-362 - Remove parse-text from unsupported filetypes in
-    parse-plugins.xml (siren)
-. NUTCH-305 - Update crawl and url filter lists to exclude
-    jpeg|JPEG|bmp|BMP, suffix-urlfilter.txt (contributed by Stefan
-    Neufeind) is also updated (siren)
-. NUTCH-406 - Metadata tries to write null values (mattmann)
-. NUTCH-415 - Generator should mark selected records in CrawlDb.
-    Due to increased resource consumption this step is optional.
-    Application-level locking has been added to prevent concurrent
-    modification of databases. (ab)
-. NUTCH-416 - CrawlDatum status and CrawlDbReducer refactoring. It is
-    now possible to correctly update CrawlDb from multiple segments.
-    Introduce new status codes for temporary and permanent
-    redirection. (ab)
-. NUTCH-322 - Fix Fetcher to store redirected pages and to store
-    protocol-level status. This also should fix NUTCH-273. (ab)
-. Change default Fetcher behavior not to follow redirects immediately.
-    Instead Fetcher will record redirects as new pages to be added to CrawlDb.
-    This also partially addresses NUTCH-273. (ab)
-. Detect and report when Generator creates 0-sized segments. (ab)
-. Fix Injector to preserve already existing CrawlDatum if the seed list
-    being injected also contains such URL. (ab)
-. NUTCH-425, NUTCH-426 - Fix anchors pollution. Continue after
-    skipping bad URLs. (Michael Stack via ab)
-. NUTCH-325 - UrlFilters.java throws NPE in case urlfilter.order contains
-    Filters that are not in plugin.includes (Stefan Groschupf, siren)
-. NUTCH-421 - Allow predeterminate running order of indexing filters
-    (Alan Tanaman, siren)
-. When indexing pages with redirection, drop all intermediate pages and
-    index only the final page. (ab)
-. Upgrade to Hadoop 0.10.1. (ab)
-. NUTCH-420 - Fix a bug in DeleteDuplicates where results depended on the
-    order in which IndexDoc-s are processed. (Dogacan Guney via ab)
-. NUTCH-428 - NullPointerException thrown when agent name is not
-    configured properly. Changed to throw RuntimeException instead.
-    (siren)
-. NUTCH-430 - Integer overflow in HashComparator.compare (siren)
-. NUTCH-68 - Add a tool to generate arbitrary fetchlists. (ab)
-. NUTCH-433 - java.io.EOFException in newer nightlies in mergesegs
-    or indexing from hadoop.io.DataOutputBuffer (siren)
-. NUTCH-339 - Fetcher2: a queue-based fetcher implementation. (ab)
-. NUTCH-390 - Javadoc warnings (mattmann)
-. NUTCH-449 - Make junit output format configurable. (nigel via cutting)
-. NUTCH-432 - Fix a bug where platform name with spaces would break the
-    bin/nutch script. (Brian Whitman via ab)
-. Upgrade to Hadoop 0.11.2 and Lucene 2.1.0 release. (ab)
-. NUTCH-167 - Observation of robots "noarchive" directive. (ab)
-. NUTCH-384 - Protocol-file plugin does not allow the parse plugins
-    framework to operate properly (Heiko Dietze via mattmann)
-. NUTCH-233 - Wrong regular expression hangs reduce process forever (Stefan
-    Groschupf via kubes)
-. NUTCH-436 - Incorrect handling of relative paths when the embedded URL
-    path is empty (kubes)
-. Upgrade to Hadoop 0.12.1 release. (ab)
-. NUTCH-246 - Incorrect segment size being generated due to time
-    synchronization issue (Stefan Groschupf via ab)
-. Upgrade to Hadoop 0.12.2 release. (ab)
-. NUTCH-333 - SegmentMerger and SegmentReader should use NutchJob. (Michael
-    Stack and Dogacan Guney via kubes)
-Release 0.8 - 2006-07-25
-. Totally new architecture, based on hadoop
-    [http://lucene.apache.org/hadoop] (cutting)
-. NUTCH-107 - Typo in plugin/urlfilter-*/plugin.xml. (Stephen Cross).
-. NUTCH-108 - Log hosts that exceed generate.max.per.host.
-    (Rod Taylor via cutting)
-. NUTCH-88 - Enhance ParserFactory plugin selection policy
-    (jerome)
-. NUTCH-124 - Protocol-httpclient does not follow redirects when
-    fetching robots.txt (cutting)
-. NUTCH-130 - Be explicit about target JVM when building (1.4.x?)
-    (stack@archive.org, cutting)
-. NUTCH-114 - Getting number of urls and links from crawldb
-    (Stefan Groschupf via ab)
-. NUTCH-112 - Link in cached.jsp page to cached content is an
-    absolute link (Chris A. Mattmann via jerome)
-. NUTCH-135 - Http header meta data are case insensitive in the
-    real world (Stefan Groschupf via jerome)
-. NUTCH-145 - Build of war file fails on Chinese (zh) .xml files due
-    to UTF-8 BOM (KuroSaka TeruHiko via siren)
-. NUTCH-121 - SegmentReader for mapred (Rod Taylor via ab)
-. Added support for OpenSearch (cutting)
-. NUTCH-142 - NutchConf should use the thread context classloader
-    (Mike Cannon-Brookes via pkosiorowski)
-. NUTCH-160 - Use standard Java Regex library rather than
-    org.apache.oro.text.regex (Rod Taylor via cutting)
-. NUTCH-151 - CommandRunner can hang after the main thread exec is
-    finished and has inefficient busy loop (Paul Baclace via cutting)
-. NUTCH-174 - Problem encountered with ant during compilation
-. NUTCH-190 - ParseUtil drops reason for failed parse
-    (stack@archive.org via ab)
-. NUTCH-169 - Remove static NutchConf (Marko Bauhardt via ab)
-. NUTCH-194 - Nutch-169 introduced two tiny bugs (Marko Bauhardt via ab)
-. NUTCH-178 - in search.jsp must be session creation "false"
-    (YourSoft via siren)
-. NUTCH-200 - OpenSearch Servlet ist broken
-    (Marko Bauhardt via siren)
-. NUTCH-81 - Webapp only works when deployed in root
-    (AJ Banck, Michael Nebel via siren)
-. NUTCH-139 - Standard metadata property names in the ParseData
-    metadata (Chris A. Mattmann, jerome)
-. NUTCH-192 - Meta data support for CrawlDatum
-    (Stefan Groschupf via ab)
-. NUTCH-52 - Parser plugin for MS Excel files
-    (Rohit Kulkarni via jerome)
-. NUTCH-53 -  Parser plugin for Zip files
-    (Rohit Kulkarni via jerome)
-. NUTCH-137 - footer is not displayed in search result page
-    (KuroSaka TeruHiko via siren)
-. NUTCH-118 - FAQ link points to invalid URL
-    (Steve Betts via siren)
-. NUTCH-184 - Serbian (sr, Cyrilic) and Serbo-Croatian (sh, Latin)
-    translation (Ivan Sekulovic via siren)
-. NUTCH-211 - FetchedSegments leave readers open (Stefan Groschupf
-    via cutting)
-. NUTCH-140 - Add alias capability in parse-plugins.xml file that
-    allows mimeType->extensionId mapping (Chris A. Mattmann via jerome)
-. NUTCH-214 - Added Links to web site to search mailling list
-    (Jake Vanderdray via jerome)
-. NUTCH-204 - Multiple field values in HitDetails
-    (Stefan Groschupf via jerome)
-. NUTCH-219 - file.content.limit & ftp.content.limit should be changed
-    to -1 to be consistent with http (jerome)
-. NUTCH-221 - Prepare nutch for upcoming lucene 2.0 (siren)
-. NUTCH-91 - Empty encoding causes exception (Michael Nebel via
-    pkosiorowski)
-. NUTCH-228 - Clustering plugin descriptor broken (Dawid Weiss via
-    jerome)
-. NUTCH-229 - Improved handling of plugin folder configuration
-    (Stefan Groschupf via ab)
-. NUTCH-206 - Search server throws InstantiationException (ab)
-. NUTCH-203 - ParseSegment throws InstantiationException (Marko Bauhardt
-    via ab)
-. NUTCH-3 - Multi values of header discarded (Stefan Groschupf via ab)
-. Update to lucene 1.9.1 (cutting)
-. NUTCH-235 - Duplicate Inlink values (ab)
-. NUTCH-234 - Clustering extension code cleanups and a real
-    JUnit test case for the current implementation (Dawid Weiss via ab)
-. NUTCH-210 - Context.xml file for Nutch web application
-    (Chris A. Mattmann via jerome)
-. NUTCH-231 - Invalid CSS entries (AJ Banck via jerome)
-. NUTCH-232 - Search.jsp has multiple search forms creating
-    invalid html / incorrect focus function (jerome)
-. NUTCH-196 - lib-xml and lib-log4j plugins (ab, jerome)
-. NUTCH-244 - Inconsistent handling of property values
-    boundaries / unable to set db.max.outlinks.per.page to
-    infinite (jerome)
-. NUTCH-245 - DTD for plugin.xml configuration files
-    (Chris A. Mattmann via jerome)
-. NUTCH-250 - Generate to log truncation caused by
-    generate.max.per.host (Rod Taylor via cutting)
-. NUTCH-125 - OpenOffice Parser plugin (ab)
-. Switch from using java.io.File to org.apache.hadoop.fs.Path.
-    (cutting)
-. NUTCH-240 - Scoring API: extension point, scoring filters and
-    an OPIC plugin (ab)
-. NUTCH-134 - Summarizer doesn't select the best snippets (jerome)
-. NUTCH-268 - Generator and lib-http use different definitions of
-    "unique host" (ab)
-. NUTCH-280 - Url query causes NullPointerException (Grant Glouser
-    via siren)
-. NUTCH-285 - LinkDb Fails rename doesn't create parent directories
-    (Dennis Kubes via ab)
-. NUTCH-201 - Add support for subcollections
-    (siren)
-. NUTCH-298 - If a 404 for a robots.txt is returned a NPE is thrown
-    (Stefan Groschupf via jerome)
-. NUTCH-275 - Fetcher not parsing XHTML-pages at all (jerome)
-. NUTCH-301 - CommonGrams loads analysis.common.terms.file for each query
-    (Stefan Groschupf via jerome)
-. NUTCH-110 - OpenSearchServlet outputs illegal xml characters
-    (stack@archive.org via siren)
-. NUTCH-292 - OpenSearchServlet: OutOfMemoryError: Java heap space
-    (Stefan Neufeind via siren)
-. NUTCH-307 - Wrong configured log4j.properties (jerome)
-. NUTCH-303 - Logging improvements (jerome)
-. NUTCH-308 - Maximum search time limit (ab)
-. NUTCH-306 - DistributedSearch.Client liveAddresses concurrency
-    problem (Grant Glouser via siren)
-. Update to hadoop-0.4 (Milind Bhandarkar, cutting)
-. NUTCH-317 - Clarify what the queryLanguage argument of
-    Query.parse(...) means (jerome)
-. Added alternative experimental web gui in contrib containing
-    extensions like subcollection, keymatch, user preferences,
-    caching, implemented mainly using tiles and jstl (siren)
-. NUTCH-320 DmozParser does not output list of urls to stdout
-    but to a log file instead. Original functionality restored.
-. NUTCH-271 - Add ability to limit crawling to the set of initially
-    injected hosts (db.ignore.external.links) (Philippe Eugene,
-    Stefan Neufeind via ab)
-. NUTCH-293 - Support for Crawl-Delay (Stefan Groschupf via ab)
-. NUTCH-327 - Fixed logging directory on cygwin (siren)
-Release 0.7 - 2005-08-17
-. Added support for "type:" in queries. Search results are limited/qualified
-    by mimetype or its primary type or sub type. For example,
-    (1) searching with "type:application/pdf" restricts results
-    to pages which were identified to be of mimetype "application/pdf".
-    (2) with "type:application", nutch will return pages of
-    primary type "application".
-    (3) with "type:pdf", only pages of sub type "pdf" will be listed.
-    (John Xing, 20050120)
-. Added support for "date:" in queries. Last-Modified is indexed.
-    Search results are restricted by lower and upper date (inclusive)
-    as date:yyyymmdd-yyyymmdd. For example, date:20040101-20041231
-    only returns pages with Last-Modified in year 2004.
-    (John Xing, 20050122)
-. Add URLFilter plugin interface and convert existing url filters into
-    plugins. (John Xing, 20050206)
-. Add UpdateSegmentsFromDb tool, which updates the scores and
-    anchors of existing segments with the current values in the web
-    db.  This is used by CrawlTool, so that pages are now only fetched
-    once per crawl.  (Doug Cutting, 20050221)
-. Moved code into org.apache.nutch sub-packages.  Changed license to
-    Apache 2.0.  Removed jar files whose licenses do not permit
-    redistribution by Apache.  Disabled compilation of plugins which
-    require these libraries.  (Doug Cutting 20050301)
-. Index host and title in separate fields.  Host was indexed
-    previously only as a part of the URL.  Title was indexed as an
-    anchor.  Now boosts for matching these fields may be adjusted
-    separately from boosts for matching anchors and url.  Also: move
-    site indexing to index-basic plugin to minimize the number of
-    times the URL needs to be parsed; and, stop using anchor analyzer
-    for anything but anchors.  (Piotr Kosiorowski via Doug Cutting
-    20050323)
-. Add servlet Cached.java that serves cached Content of any mime type.
-    Slightly modified are web.xml and cached.jsp.
-    (John Xing, 20050401)
-. Add skipCompressedByteArray() to WritableUtils.java.
-    (John Xing, 20050402)
-. Fixes to jsp and static web pages.  These now use relative links,
-    so that the Nutch webapp file can be used in places other than at
-    the root.  Also fixed links to the about and help pages.  Bug #32.
-    (Jerome Charron via cutting, 20050404)
-. Added some features to DistributedSearch: new segments can be added
-    to searchservers without restarting the frontend, defective search
-    servers are not queried until tey come back online, watchdog keeps
-    an eye for your searchservers and writes simple statistics.
-    (Sami Siren, 20050407)
-. Fix for bug #4 - Unbalanced quote in query eats all resources.
-  (Piotr Kosiorowski, Sami Siren, 20050407)
-. Close Issue #33 - MIME content type detector (using magic char sequences).
-    (Jerome Charron and Hari Kodungallur via John Xing, 20050416)
-. Add a servlet that implements A9's OpenSearch RSS web service.
-    (cutting, 20050418)
-. Remove references to link analysis from tutorial, and enable
-    scoring by link count when generating fetchlists and searching.
-    (cutting, 20040419)
-. Make query boosts for host, title, anchor and phrase matches
-    configurable.  (Piotr Kosiorowski via cutting, 20050419)
-. Add support for sorting search results and search-time deduping by
-    fields other than site.
-. Automatically convert range queries into cached range filters.
-    This improves the performance and scalability of, e.g., date range
-    searching.
-. Several methods have been renamed due to misspellings.  The old
-    methods have been deprecated and will be removed before the 1.0
-    release.
-Release 0.6
-. Added clustering-carrot2 plugin, together with introduction of clustering
-    api and modification to search jsp. (Dawid Weiss via John Xing, 20040809)
-. Make a number of changes to NDFS (Nutch Distributed File System)
-    to fix bugs, add admin tools, etc.
-    Also, modify all command line tools so you can indicate whether to
-    use NDFS or the local filesystem.  If you indicate nothing, then
-    it defaults to the local fs.
-    I've used this to do a 35m page crawl via NDFS, distributed over a
-    dozen machines.  (Mike Cafarella)
-. Add support for BASE tags in HTML.  Outlinks are now correctly
-    extracted when a BASE tag is present.  (cutting)
-. Fix two bugs in result pagination.  When the last hit on a page
-    was the last hit overall, the "next" button was sometimes shown
-    when the "show all" button should be shown instead.  Also, in
-    certain cases, the "show all" button would be shown when the
-    "next" button should have been shown.  (cutting)
-. Add config parameter "indexer.max.tokens" that determines the
-    maximum number of tokens indexed per field.  (Andy Hedges via cutting)
-. Add parser for mp3 files.  (Andy Hedges via cutting)
-. Add RegexUrlNormalizer.  This is useful for things like stripping
-    out session IDs from URLs.  To use it, add values for
-    urlnormalizer.class and urlnormalizer.regex.file to your
-    nutch-site.xml.  The RegexUrlNormalizer class extends the
-    BasicUrlNormalizer, and does basic normalization as well.
-    (Luke Baker via cutting)
-. Added Swedish translation (Stefan Verzel via Sami Siren, 20040910)
-. Added Polish translation (Andrzej Bialecki, 20040911)
-. Added 3 more language profiles to language identifier (ru,hu,pl).
-  Other changes to language identifier: Porfiles converted to utf8,
-  added some test cases, changed the similarity calculation.
-  (Sami Siren, 20040925)
-. Added plugin parse-rtf (Andy Hedges via John Xing, 20040929)
-. Added plugin index-more and more.jsp (John Xing, 20041003)
-. Added "View as Plain Text" feature. A new op OP_PARSETEXT is introduced
-    in DistributedSearch.java. text.jsp is added. (John Xing, 20041006)
-. Fixed a bug that fails cached.jsp, explain.jsp, anchors.jsp and text.jsp
-    (but not search.jsp) with NullPointerException in distributed search.
-    It seems that this bug appears after "hits per site" stuff is added.
-    The fix is done in Hit.java, making sure String site is never null.
-    Hope this fix not have bad effetct on "hits per site" code.
-    (John Xing, 20041006)
-. Fixed a bug that fails fullyDelete() in FileUtil.java for
-    LocalFileSystem.java. This bug also exposes possible incompleteness
-    of NDFSFile.java, where a few methods are not supported, including
-    delete(). Nothing changed in NDFSFile.java though. Leave it for future
-    improvement (John Xing, 20041022).
-. Introduced option -noParsing to Fetcher.java and added ParseSegment.java.
-    A new status code CANT_PARSE is added to FetcherOutput.java.
-    Without option -noParsing , no change in fetcher behavior. With
-    option -noParsing, fetcher does crawls only, no parsing is carried out.
-    Then, ParseSegment.java should be used to parse in separate pass.
-    (John Xing, 20041025)
-. Added ontology plugin. Currently it is used for query refinement, as
-    examplified in refine-query-init.jsp and refine-query.jsp. By default,
-    query refinement is disabled in search.jsp. Please check
-    ./src/plugin/ontology/README.txt for further description.
-    Ontology plugin certainly can be used for many other things.
-    (Michael J. Pan via John Xing, 20041129)
-. Changed fetcher.server.delay to be a float, so that sub-second
-    delays can be specified.  (cutting)
-. Added plugin.includes config parameter that determines which
-    plugins are included.  By default now only http, html and basic
-    indexing and search plugins are enabled, rather than all plugins.
-    This should make default performance more predictable and reliable
-    going forward. (cutting)
-. Cleaned up some filesystem code, including:
-    - Replaced BufferedRandomAccessFile with two simpler utilties,
-      NFSDataInputStream and NFSDataOutputStream.
-    - Fixed the bug where SequenceFiles were no longer flushed when
-      created, so that, when fetches crashed, segments were
-      unreadable.  Now segments are always readable after crashes.
-      Only the contents of the last buffer is lost.
-    - Simplified the FSOutputStream API to not include seek().  We
-      should never need that functionality.
-    - Simplified LocalFileSystem's implementations of FSInputStream
-      and FSOutputStream and optimized FSInputStream.seek().
-    (cutting)
-. Fixed BasicUrlNormalizer to better handle relative urls.  The file
-    part of a URL is normalized in the following manner:
-. "/aa/../" will be replaced by "/" This is done step by step until
-   the url doesnÂ´t change anymore. So we ensure, that
-   "/aa/bb/../../" will be replaced by "/", too
-. leading "/../" will be replaced by "/"
-    (Sven Wende via cutting)
-. Fix Page constructors so that next fetch date is less likely to be
-    misconstrued as a float.  This patches a problem in WebDBInjector,
-    where new pages were added to the db with nextScore set to the
-    intended nextFetch date.  This, in turn, confused link analysis.
-. In ndfs code, replace addLocalFile(), putToLocalFile() with
-    copyFromLocalFile(), moveFromLocalFile(), copyToLocalFile() and
-    moveToLocalFile(). (John Xing, 20041217)
-. Added new config parameter fetcher.threads.per.host.  This is used
-    by the Http protocol.  When this is one behavior is as before.
-    When this is greater than one then multiple threads are permitted
-    to access a host at once.  Note that fetcher.server.delay is no
-    longer consistently observed when this is greater than one.
-    (Luke Baker via Doug Cutting)
-Release 0.5
-. Changed plugin directory to be a list of directories.
-. Permit Plugin to be the default plugin implementation.
-. Added pluggable interface for network protocols in new package
-    net.nutch.protocol.  Moved http code from core into a plugin.
-. Added pluggable interface for content parsing in new package
-    net.nutch.parse.  Moved html parsing code from core into a
-    plugin.
-. Fixed a bug in NutchAnalysis where 16-bit characters were not
-    processed correctly.
-. Fixed bug #971731: random summaries on result page.
-    (Daniel Naber via cutting)
-. Made Nutch logo transparent. (Daniel Naber via cutting)
-. Added file protocol plugin.  (John Xing via cutting)
-. Added ftp protocol plugin.  (John Xing via cutting)
-. Added pdf and msword parser plugins.  (John Xing via cutting)
-. Added pluggable indexing interface.  By default, url, content,
-    anchors and title are indexed, as before, but now one can easily
-    alter this to, e.g., index metadata.  A demonstration is provided
-    which extracts and indexes Creative Commons license urls. (cutting)
-. Add language identification plugin.
-    The process of identification is as follows:
-. html (html only, HTML 4.0 "lang" attribute)
-. meta tags (html only, http-equiv, dc.language)
-. http header (Content-Language)
-. if all above fail "statistical analysis"
-& 2 are run during the fetching phase and 3 & 4 are run on
-    indexing phase.
-    Currently supported languages (in "statistical analysis") are
-    da,de,el,en,es,fi,fr,it,nl,sv and pt. The corpus used was grabbed
-    from http://www.isi.edu/~koehn/europarl/ and the profiles were
-    build with tool supplied in patch.
-    After indexing the language can be found from field named "lang"
-    It's not 100% accurate but it's a start.
-    (Sami Siren)
-. Added SegmentMergeTool and "mergesegs" command, to remove
-    duplicated or otherwise not used content from several segments and
-    joining them together into a single new segment.  The tool also
-    optionally performs several other steps required for proper
-    operation of Nutch - such as indexing segments, deleting
-    duplicates, merging indices, and indexing the new single segment.
-    (Andrzej Bialecki)
-. Add the ability to retrieve ParseData of a search hit. ParseData
-    contains many valuable properties of a search hit.
-    This is required (among others) to properly display the cached
-    content because it's not possible to determine the character
-    encoding from the output of the getContent() method (which returns
-    byte[]). The symptoms are that for HTML pages using non-latin1 or
-    non-UTF8 encodings the cached preview will almost certainly look
-    broken. Using the attached patch it is possible to determine the
-    character encoding from the ParseData (for HTTP: Content-Type
-    metadata), and encode the content accordingly. (Andrzej Bialecki)
-. Add a pluggable query interface.  By default, the content, anchor
-    and url fields are searched as before.  A sample plugin indexes
-    the host name and adds a "site:" keyword to query parsing.
-. Added support for "lang:" in queries.  For example, searching with
-    "lang:en" restricts results to pages which were identified to
-    be in English.
-. Automatically optimize field queries to use cached Lucene filters.
-    This makes, for example, searches restricted by languages or sites
-    that are very common much faster.
-. Improved charset handling in jsp pages.  (jshin by cutting)
-. Permit topic filtering when injecting DMOZ pages.  (jshin by cutting)
-. When parsing crawled pages, interpret charset specifications in
-    html meta tags.  (jshin by cutting)
-. Added support for "cc:licensed" in queries, which searches for documents
-    released under Creative Commons licenses.  Attributes of the
-    license may also be queried, with, e.g., "cc:by" for
-    attribution-required licenses, "cc:nc" for non-commercial
-    licenses, etc.
-. Relative paths named in plugin.folders are now searched for on the
-    classpath.  This makes, e.g., deployment in a war file much simpler.
-. Modifications to Fetcher.java.
-. Make sure it works properly with regard to creation and initialization
-    of plugin instances. The problem was that multiple threads race to
-    startUp() or shutDown() plugin instances. It was solved by synchronizing
-    certain codes in PluginRepository.java and Extension.java.
-    (Stefan Groschupf via John Xing)
-. Added code to explictly shutDown() plugins. Otherwise FetcherThreads
-    may never return (quit) if there are still data or other structures
-    (e.g., persistent socket connections) associated with plugins. (John Xing)
-. Fixed one type of Fetcher "hang" problems by monitoring named
-    FetcherThreads. If all FetcherThreads are gone (finished),
-    Fetcher.java is considered done. The problem was: there could be
-    runaway threads started by external libs via FetcherThreads.
-    Those threads never return, thus keep Fetcher from exiting normally.
-    (John Xing)
-. Eliminate excessive hits from sites.  This is done efficiently by
-    adding the site name to Hit instances, and, when needed,
-    re-querying with too-frequent sites prohibited in the query.
-Release 0.4
-. Http class refactored.  (Kevin Smith via Tom Pierce)
-. Add Finnish translation. (Sampo Syreeni via Doug Cutting)
-. Added Japanese translation. (Yukio Andoh via Doug Cutting)
-. Updated Dutch translation. (Ype Kingma via Doug Cutting)
-. Initial version of Distributed DB code.  (Mike Cafarella)
-. Make things more tolerant of crashed fetcher output files.
-    (Doug Cutting)
-. New skin for website. (Frank Henze via Doug Cutting)
-. Added Spanish translation. (Diego Basch via Doug Cutting)
-. Add FTP support to fetcher.  (John Xing via Doug Cutting)
-. Added Thai translation. (Pichai Ongvasith via Doug Cutting)
-. Added Robots.txt & throttling support to Fetcher.java.  (Mike
-    Cafarella)
-. Added nightly build. (Doug Cutting)
-. Default all link scores to 1.0. (Doug Cutting)
-. Permit one to keep internal links. (Doug Cutting)
-. Fixed dedup to select shortest URL. (Doug Cutting)
-. Changed index merger so that merged index is written to named
-    directory, rather than to a generated name in that directory.
-    (Doug Cutting)
-. Disable coordination weighting of query clauses and other minor
-    scoring improvements. (Doug Cutting)
-. Added a new command, crawl, that constructs a database, injects a
-    url file and performs a few rounds of generate/fetch/updatedb.
-    This simplifies use for intranet sites.  Changed some defaults to
-    be more intranet friendly.  (Doug Cutting)
-. Fixed a bug where Fetcher.java didn't construct correct relative
-    links when a page was redirected.  (Doug Cutting)
-. Fixed a query parser problem with lookahead over plusses and minuses.
-    (Doug Cutting)
-. Add support for HTTP proxy servers.  (Sami Siren via Doug Cutting)
-. Permit searching while fetching and/or indexing.
-    (Sami Siren via Doug Cutting)
-. Fix a bug when throttling is disabled.  (Sami Siren via Doug Cutting)
-. Updated Bahasa Malaysia translation.  (Michael Lim via Doug Cutting)
-. Added Catalan translation.  (Xavier Guardiola via Doug Cutting)
-. Added brazilian portuguese translation.
-    (A. Moreir via Doug Cutting)
-. Added a french translation.  (Julien Nioche via Doug Cutting)
-. Updated to Lucene 1.4RC3.  (Doug Cutting)
-. Add capability to boost by link count & use it in crawl tool.
-    (Doug Cutting)
-. Added plugin system.  (Stefan Groschupf via Doug Cutting)
-. Add this change log file, for recording significant changes to
-    Nutch.  Populate it with changes from the last few months.

nutchez-0.1/debian/control

r66	r73
1		Source: nutch
	1	Source: nutchez
2	2	Section:devel
3	3	Priority: extra

nutchez-0.1/debian/files

r66	r73
1		nutch~~_1.0~~-1_i386.deb devel extra
	1	nutchez_0.1-1_i386.deb devel extra

nutchez-0.1/debian/nutchez.install

r67	r73
6	6	tomcat opt/nutch
7	7	plugins opt/nutch
8		~~urls opt/nutch~~
9	8	*.jar opt/nutch
10	9	*.job opt/nutch

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 73

Legend:

nutchez-0.1/CHANGES.txt

nutchez-0.1/debian/control

nutchez-0.1/debian/files

nutchez-0.1/debian/nutchez.install

Download in other formats: