wiki:waue/2009/0922

hadoop streaming + perl

僅看到,未測

注意

目前 streaming 對 linux pipe #也就是 cat |wc -l 這樣的管道 不支持,但不妨礙我們使用perl,python 行式命令!! 原話是 :

Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?

Currently this does not work and gives an "java.io.IOException: Broken pipe" error. This is probably a bug that needs to be investigated.

但如果你是強烈的 linux shell pipe 發燒友 ! 參考下面

$> perl -e 'open( my $fh, "grep -v null tt |sed -n 1,5p |");while ( <$fh> ) {print;} '
  #不過我沒測試通過 !!

環境 :hadoop-0.18.3

$> find . -type f -name "*streaming*.jar"
./contrib/streaming/hadoop-0.18.3-streaming.jar

測試數據:

$ head tt 
null    false    3702    208100
6005100    false    70    13220
6005127    false    24    4640
6005160    false    25    4820
6005161    false    20    3620
6005164    false    14    1280
6005165    false    37    7080
6005168    false    104    20140
6005169    false    35    6680
6005240    false    169    32140
......

運行:

c1="  perl -ne  'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}'  "
# 注意 這裡 $ 要寫成 \$    " 寫成 \"

echo $c1; 

hadoop jar hadoop-0.18.3-streaming.jar
   -input file:///data/hadoop/lky/jar/tt 
   -mapper   "/bin/cat" 
   -reducer "$c1" 
   -output file:///tmp/lky/streamingx8

結果:

cat /tmp/lky/streamingx8/*
1166480

本地運行輸出:

perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt
1166480

結果正確!!!!

命令自帶文檔:

-bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info
09/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input
Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
          $HADOOP_HOME/hadoop-streaming.jar [options]
Options:
  -input    <path>     DFS input file(s) for the Map step
  -output   <path>     DFS output directory for the Reduce step
  -mapper   <cmd|JavaClassName>      The streaming command to run
  -combiner <JavaClassName> Combiner has to be a Java class
  -reducer  <cmd|JavaClassName>      The streaming command to run
  -file     <file>     File/dir to be shipped in the Job jar file
  -dfs    <h:p>|local  Optional. Override DFS configuration
  -jt     <h:p>|local  Optional. Override JobTracker configuration
  -additionalconfspec specfile  Optional.
  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
  -partitioner JavaClassName  Optional.
  -numReduceTasks <num>  Optional.
  -inputreader <spec>  Optional.
  -jobconf  <n>=<v>    Optional. Add or override a JobConf property
  -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
  -mapdebug <path>  Optional. To run this script when a map task fails 
  -reducedebug <path>  Optional. To run this script when a reduce task fails 
  -cacheFile fileNameURI
  -cacheArchive fileNameURI
  -verbose

參考:

Last modified 15 years ago Last modified on Sep 28, 2009, 9:25:07 AM