Context Navigation

0922

Timestamp:: Sep 28, 2009, 9:24:01 AM (16 years ago)
Author:: waue
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

waue/2009/0922

-                      v1
+                      v2
+= hadoop streaming + perl =
+僅看到，未測
+注意
+  目前 streaming 對 linux pipe #也就是 cat |wc -l 這樣的管道 不支持，但不妨礙我們使用perl,python 行式命令！！
+  原話是 ：
+  Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?
+    Currently this does not work and gives an "java.io.IOException: Broken pipe" error.
+    This is probably a bug that needs to be investigated.
+  但如果你是強烈的 linux shell pipe 發燒友 ！ 參考下面
+{{{
+$> perl -e 'open( my $fh, "grep -v null tt |sed -n 1,5p |");while ( <$fh> ) {print;} '
+  #不過我沒測試通過 ！！
+}}}
+環境 ：hadoop-0.18.3
+{{{
+$> find . -type f -name "*streaming*.jar"
+./contrib/streaming/hadoop-0.18.3-streaming.jar
+}}}
+測試數據：
+{{{
+$ head tt
+null    false    3702    208100
+6005100    false    70    13220
+6005127    false    24    4640
+6005160    false    25    4820
+6005161    false    20    3620
+6005164    false    14    1280
+6005165    false    37    7080
+6005168    false    104    20140
+6005169    false    35    6680
+6005240    false    169    32140
+......
+}}}
+運行：
+{{{
+#!perl
+c1="  perl -ne  'if(/.*\t(.*)/){\$sum+=\$1;}END{print \"\$sum\";}'  "
+# 注意 這裡 $ 要寫成 \$    " 寫成 \"
+echo $c1; # 打印輸出  perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}'
+hadoop jar hadoop-0.18.3-streaming.jar
+   -input file:///data/hadoop/lky/jar/tt
+   -mapper   "/bin/cat"
+   -reducer "$c1"
+   -output file:///tmp/lky/streamingx8
+}}}
+結果:
+{{{
+cat /tmp/lky/streamingx8/*
+1166480
+}}}
+本地運行輸出:
+{{{
+perl -ne 'if(/.*"t(.*)/){$sum+=$1;}END{print $sum;}' < tt
+1166480
+}}}
+結果正確!!!!
+命令自帶文檔：
+{{{
+-bash-3.00$ hadoop jar hadoop-0.18.3-streaming.jar -info
+/09/25 14:50:12 ERROR streaming.StreamJob: Missing required option -input
+Usage: $HADOOP_HOME/bin/hadoop [--config dir] jar \
+          $HADOOP_HOME/hadoop-streaming.jar [options]
+Options:
+  -input    <path>     DFS input file(s) for the Map step
+  -output   <path>     DFS output directory for the Reduce step
+  -mapper   <cmd|JavaClassName>      The streaming command to run
+  -combiner <JavaClassName> Combiner has to be a Java class
+  -reducer  <cmd|JavaClassName>      The streaming command to run
+  -file     <file>     File/dir to be shipped in the Job jar file
+  -dfs    <h:p>|local  Optional. Override DFS configuration
+  -jt     <h:p>|local  Optional. Override JobTracker configuration
+  -additionalconfspec specfile  Optional.
+  -inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
+  -outputformat TextOutputFormat(default)|JavaClassName  Optional.
+  -partitioner JavaClassName  Optional.
+  -numReduceTasks <num>  Optional.
+  -inputreader <spec>  Optional.
+  -jobconf  <n>=<v>    Optional. Add or override a JobConf property
+  -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
+  -mapdebug <path>  Optional. To run this script when a map task fails
+  -reducedebug <path>  Optional. To run this script when a reduce task fails
+  -cacheFile fileNameURI
+  -cacheArchive fileNameURI
+  -verbose
+}}}
+參考:
  * [http://www.blogjava.net/Skynet/archive/2009/09/25/296420.html hadoop streaming + perl]
+ * [http://hadoop.apache.org/common/docs/r0.18.3/streaming.html hadoop stream howto]