= hadoop streaming + perl = 僅看到,未測 注意 目前 streaming 對 linux pipe #也就是 cat |wc -l 這樣的管道 不支持,但不妨礙我們使用perl,python 行式命令!! 原話是 : Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work? Currently this does not work and gives an "java.io.IOException: Broken pipe" error. This is probably a bug that needs to be investigated. 但如果你是強烈的 linux shell pipe 發燒友 ! 參考下面 {{{ $> perl -e 'open( my $fh, "grep -v null tt |sed -n 1,5p |");while ( <$fh> ) {print;} ' #不過我沒測試通過 !! }}} 環境 :hadoop-0.18.3 {{{ $> find . -type f -name "*streaming*.jar" ./contrib/streaming/hadoop-0.18.3-streaming.jar }}} 測試數據: {{{ $ head tt null false 3702 208100 6005100 false 70 13220 6005127 false 24 4640 6005160 false 25 4820 6005161 false 20 3620 6005164 false 14 1280 6005165 false 37 7080 6005168 false 104 20140 6005169 false 35 6680 6005240 false 169 32140 ...... }}} 運行: {{{ c1=" perl -ne 'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}' " # 注意 這裡 $ 要寫成 \$ " 寫成 \" echo $c1; hadoop jar hadoop-0.18.3-streaming.jar -input file:///data/hadoop/lky/jar/tt -mapper "/bin/cat" -reducer "$c1" -output file:///tmp/lky/streamingx8 }}} 結果: {{{ cat /tmp/lky/streamingx8/* 1166480 }}} 本地運行輸出: {{{ perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt 1166480 }}} 結果正確!!!! 命令自帶文檔: {{{ -bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info 09/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \ $HADOOP_HOME/hadoop-streaming.jar [options] Options: -input DFS input file(s) for the Map step -output DFS output directory for the Reduce step -mapper The streaming command to run -combiner Combiner has to be a Java class -reducer The streaming command to run -file File/dir to be shipped in the Job jar file -dfs |local Optional. Override DFS configuration -jt |local Optional. Override JobTracker configuration -additionalconfspec specfile Optional. -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional. -outputformat TextOutputFormat(default)|JavaClassName Optional. -partitioner JavaClassName Optional. -numReduceTasks Optional. -inputreader Optional. -jobconf = Optional. Add or override a JobConf property -cmdenv = Optional. Pass env.var to streaming commands -mapdebug Optional. To run this script when a map task fails -reducedebug Optional. To run this script when a reduce task fails -cacheFile fileNameURI -cacheArchive fileNameURI -verbose }}} 參考: * [http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html hadoop streaming + perl] * [http://hadoop.apache.org/common/docs/r0.18.3/streaming.html hadoop stream howto]