hadoop streaming + perl
僅看到,未測
注意
目前 streaming 對 linux pipe #也就是 cat |wc -l 這樣的管道 不支持,但不妨礙我們使用perl,python 行式命令!! 原話是 :
Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
Currently this does not work and gives an "java.io.IOException: Broken pipe" error. This is probably a bug that needs to be investigated.
但如果你是強烈的 linux shell pipe 發燒友 ! 參考下面
$> perl -e 'open( my $fh, "grep -v null tt |sed -n 1,5p |");while ( <$fh> ) {print;} ' #不過我沒測試通過 !!
環境 :hadoop-0.18.3
$> find . -type f -name "*streaming*.jar" ./contrib/streaming/hadoop-0.18.3-streaming.jar
測試數據:
$ head tt null false 3702 208100 6005100 false 70 13220 6005127 false 24 4640 6005160 false 25 4820 6005161 false 20 3620 6005164 false 14 1280 6005165 false 37 7080 6005168 false 104 20140 6005169 false 35 6680 6005240 false 169 32140 ......
運行:
c1=" perl -ne 'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}' " # 注意 這裡 $ 要寫成 \$ " 寫成 \" echo $c1; hadoop jar hadoop-0.18.3-streaming.jar -input file:///data/hadoop/lky/jar/tt -mapper "/bin/cat" -reducer "$c1" -output file:///tmp/lky/streamingx8
結果:
cat /tmp/lky/streamingx8/* 1166480
本地運行輸出:
perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt 1166480
結果正確!!!!
命令自帶文檔:
-bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info 09/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input <path> DFS input file(s) for the Map step -output <path> DFS output directory for the Reduce step -mapper <cmd|JavaClassName> The streaming command to run -combiner <JavaClassName> Combiner has to be a Java class -reducer <cmd|JavaClassName> The streaming command to run -file <file> File/dir to be shipped in the Job jar file -dfs <h:p>|local Optional. Override DFS configuration -jt <h:p>|local Optional. Override JobTracker configuration -additionalconfspec specfile Optional. -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks <num> Optional. -inputreader <spec> Optional. -jobconf <n>=<v> Optional. Add or override a JobConf property -cmdenv <n>=<v> Optional. Pass env.var to streaming commands -mapdebug <path> Optional. To run this script when a map task fails -reducedebug <path> Optional. To run this script when a reduce task fails -cacheFile fileNameURI -cacheArchive fileNameURI -verbose
參考:
Last modified 15 years ago
Last modified on Sep 28, 2009, 9:25:07 AM