Merging 18 segments to /user/crawler/newcrawl_3/segments/20110725175250
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/content
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_fetch
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_generate
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/crawl_parse
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/parse_data
SegmentMerger: adding /user/crawler/crawl1/segments/20110722163308/parse_text
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/content
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_fetch
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_generate
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/crawl_parse
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/parse_data
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151117/parse_text
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/content
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_fetch
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_generate
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/crawl_parse
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/parse_data
SegmentMerger: adding /user/crawler/crawl2/segments/20110531151312/parse_text
SegmentMerger: using segment data from:
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:638)
at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:683)
Update segments
LinkDb: starting at 2011-07-25 17:52:55
LinkDb: linkdb: /user/crawler/newcrawl_3/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
Index segments
ls: Cannot access /user/crawler/newcrawl_3/segments/*: No such file or directory.
[check] /opt/crawlzilla/nutch/bin/nutch index /user/crawler/newcrawl_3/newindexes /user/crawler/newcrawl_3/crawldb /user/crawler/newcrawl_3/linkdb
Usage: Indexer <index> <crawldb> <linkdb> <segment> ...
De-duplicate indexes
Dedup: starting at 2011-07-25 17:53:02
Dedup: adding indexes in: /user/crawler/newcrawl_3/newindexes
DeleteDuplicates: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://crawlweb1:9000/user/crawler/newcrawl_3/newindexes
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getSplits(DeleteDuplicates.java:149)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:451)
at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:519)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:503)
Merge indexes
IndexMerger: starting at 2011-07-25 17:53:07
IndexMerger: merging indexes to: /user/crawler/newcrawl_3/index
IndexMerger: finished at 2011-07-25 17:53:07, elapsed: 00:00:00
Some stats
CrawlDb statistics start: /user/crawler/newcrawl_3/crawldb
Statistics for CrawlDb: /user/crawler/newcrawl_3/crawldb
TOTAL urls: 514
retry 0: 514
min score: 0.0
avg score: 0.010715953
max score: 1.076
status 1 (db_unfetched): 454
status 2 (db_fetched): 52
status 3 (db_gone): 2
status 5 (db_redir_perm): 6
CrawlDb statistics: done
finish on : /home/crawler/newcrawl_3