wiki:NCHCCloudCourse100802/Lab4

Version 2 (modified by jazz, 14 years ago) (diff)

--

實作四: MapReduce 程式編譯
Lab4: Compiling Hadoop MapReduce Java Program

Practice 1 : Word Count #1 (Basic)

  • 上傳內容到 HDFS 內
    upload data to HDFS
$ cd /opt/hadoop
$ mkdir lab4_input
$ echo "I like NCHC Cloud Course." > lab4_input/input1
$ echo "I like nchc Cloud Course, and we enjoy this course." > lab4_input/input2
$ bin/hadoop fs -put lab4_input lab4_input
$ bin/hadoop fs -ls lab4_input
  • 下載 WordCount.java 並存到/opt/hadoop;
    Download WordCount.java and save to /opt/hadoop
    ~$ cd /opt/hadoop
    /opt/hadoop$ wget http://secuse.nchc.org.tw/class/WordCount.java
    
  • 運作程式
    Compile WordCount.java and run it by hadoop jar command
$ mkdir MyJava
$ javac -classpath hadoop-*-core.jar -d MyJava WordCount.java
$ jar -cvf wordcount.jar -C MyJava .
$ bin/hadoop jar wordcount.jar WordCount lab4_input/ lab4_out1/
$ bin/hadoop fs -cat lab4_out1/part-00000
  • lab4_out1 執行結果
    You should see results like this :
    Cloud 2
    Course, 1
    Course. 1
    I 2
    NCHC  1
    and 1
    course. 1
    enjoy 1
    like  2
    nchc  1
    this  1
    we  1
    

Practice 2 : Word Count #2 (Advanced)

$ echo "\." >pattern.txt && echo "\," >>pattern.txt
$ bin/hadoop fs -put pattern.txt ./
$ mkdir MyJava2
  • 下載 WordCount2.java 並存到/opt/hadoop;
    Download WordCount2.java to /opt/hadoop
    ~$ cd /opt/hadoop
    /opt/hadoop$ wget http://trac.nchc.org.tw/cloud/raw-attachment/wiki/Hadoop_Lab4/WordCount2.java
    
$ javac -classpath hadoop-*-core.jar -d MyJava2 WordCount2.java
$ jar -cvf wordcount2.jar -C MyJava2 .
$ bin/hadoop jar wordcount2.jar WordCount2 lab4_input lab4_out2 -skip pattern.txt
$ bin/hadoop fs -cat lab4_out2/part-00000
  • lab4_out2 執行結果
    You should see results like this:
    Cloud 2
    Course  2
    I 2
    NCHC  1
    and 1
    course  1
    enjoy 1
    like  2
    nchc  1
    this  1
    we  1
    
  • Let's given case insensitive and ignore pattern for this example
    /opt/hadoop$ echo "\," > pattern.txt && echo "\." >> pattern.txt
    /opt/hadoop$ bin/hadoop jar wordcount2.jar WordCount2 -Dwordcount.case.sensitive=false lab4_input lab4_out3 -skip pattern.txt
    /opt/hadoop$ bin/hadoop fs -cat lab4_out3/part-00000
    
  • lab4_out3 執行結果
    You should see results like this:
    and 1
    cloud 2
    course  3
    enjoy 1
    i 2
    like  2
    nchc  2
    this  1
    we  1