wiki:waue/2011/0725

Version 1 (modified by waue, 13 years ago) (diff)

--

Merging 18 segments to /user/crawler/newcrawl_3/segments/20110725175250
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/content
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/crawl_fetch
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/crawl_generate
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/crawl_parse
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/parse_data
SegmentMerger:   adding /user/crawler/crawl1/segments/20110722163308/parse_text
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/content
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/crawl_fetch
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/crawl_generate
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/crawl_parse
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/parse_data
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151117/parse_text
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/content
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/crawl_fetch
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/crawl_generate
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/crawl_parse
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/parse_data
SegmentMerger:   adding /user/crawler/crawl2/segments/20110531151312/parse_text
SegmentMerger: using segment data from:
Exception in thread "main" java.io.IOException: No input paths specified in job
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
  at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
  at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:638)
  at org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:683)
Update segments
LinkDb: starting at 2011-07-25 17:52:55
LinkDb: linkdb: /user/crawler/newcrawl_3/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: java.io.IOException: No input paths specified in job
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:152)
  at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
  at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
  at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
  at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

Index segments
ls: Cannot access /user/crawler/newcrawl_3/segments/*: No such file or directory.
[check] /opt/crawlzilla/nutch/bin/nutch index /user/crawler/newcrawl_3/newindexes /user/crawler/newcrawl_3/crawldb /user/crawler/newcrawl_3/linkdb 
Usage: Indexer <index> <crawldb> <linkdb> <segment> ...
De-duplicate indexes
Dedup: starting at 2011-07-25 17:53:02
Dedup: adding indexes in: /user/crawler/newcrawl_3/newindexes
DeleteDuplicates: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://crawlweb1:9000/user/crawler/newcrawl_3/newindexes
  at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
  at org.apache.nutch.indexer.DeleteDuplicates$InputFormat.getSplits(DeleteDuplicates.java:149)
  at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
  at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
  at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
  at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
  at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:451)
  at org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicates.java:519)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplicates.java:503)

Merge indexes
IndexMerger: starting at 2011-07-25 17:53:07
IndexMerger: merging indexes to: /user/crawler/newcrawl_3/index
IndexMerger: finished at 2011-07-25 17:53:07, elapsed: 00:00:00
Some stats
CrawlDb statistics start: /user/crawler/newcrawl_3/crawldb
Statistics for CrawlDb: /user/crawler/newcrawl_3/crawldb
TOTAL urls: 514
retry 0:  514
min score:  0.0
avg score:  0.010715953
max score:  1.076
status 1 (db_unfetched):  454
status 2 (db_fetched):  52
status 3 (db_gone): 2
status 5 (db_redir_perm): 6
CrawlDb statistics: done
finish on : /home/crawler/newcrawl_3