Merging 18 segments to /user/crawler/newcrawl_3/segments/20110725175250 SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/content SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_fetch SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_generate SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_parse SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/parse_data SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/parse_text SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/content SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_fetch SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_generate SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_parse SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/parse_data SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/parse_text SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/content SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_fetch SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_generate SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_parse SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/parse_data SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/parse_text SegmentMerger: using segment data from: Exception in thread "main" java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:638) at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:683) Update segments LinkDb: starting at 2011-07-25 17:52:55 LinkDb: linkdb: /user/crawler/newcrawl_3/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) Index segments ls: Cannot access /user/crawler/newcrawl_3/segments/*: No such file or directory. [check] /opt/crawlzilla/nutch/bin/nutch index /user/crawler/newcrawl_3/newindexes /user/crawler/newcrawl_3/crawldb /user/crawler/newcrawl_3/linkdb Usage: Indexer <index> <crawldb> <linkdb> <segment> ... De-duplicate indexes Dedup: starting at 2011-07-25 17:53:02 Dedup: adding indexes in: /user/crawler/newcrawl_3/newindexes DeleteDuplicates: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://crawlweb1:9000/user/crawler/newcrawl_3/newindexes at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getSplits(DeleteDuplicates.java:149) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:451) at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:519) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:503) Merge indexes IndexMerger: starting at 2011-07-25 17:53:07 IndexMerger: merging indexes to: /user/crawler/newcrawl_3/index IndexMerger: finished at 2011-07-25 17:53:07, elapsed: 00:00:00 Some stats CrawlDb statistics start: /user/crawler/newcrawl_3/crawldb Statistics for CrawlDb: /user/crawler/newcrawl_3/crawldb TOTAL urls: 514 retry 0: 514 min score: 0.0 avg score: 0.010715953 max score: 1.076 status 1 (db_unfetched): 454 status 2 (db_fetched): 52 status 3 (db_gone): 2 status 5 (db_redir_perm): 6 CrawlDb statistics: done finish on : /home/crawler/newcrawl_3
Last modified 13 years ago
Last modified on Jul 25, 2011, 6:00:26 PM