wiki:HadoopWorkshopHandsOn

Context Navigation

Version 4 (modified by jazz, 17 years ago) (diff)
--

Hadoop Hands-on Labs (1)
1. Basic DFS command / Hadoop DFS 基本測試環境建立
Hadoop Hands-on Labs (2)
1. MapReduce 程式設計練習
2. 大量部署 Hadoop 的方法

Hadoop Hands-on Labs (1)

Basic DFS command / Hadoop DFS 基本測試環境建立

download hadoop-0.18.2

$ cd ~
$ wget http://ftp.twaren.net/Unix/Web/apache/hadoop/core/hadoop-0.18.2/hadoop-0.18.2.tar.gz
$ tar zxvf hadoop-0.18.2.tar.gz

Hadoop 會用 SSH 進行內部連線，因此需要做 SSH Key exchange
```
~$ ssh-keygen
~$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys
```

需要 JAVA_HOME 環境變數才能執行 hadoop namenode

$ echo "export JAVA_HOME=/usr/lib/jvm/java-6-sun" >> ~/.bash_profile
$ cd ~/hadoop-0.18.2

編輯 conf/hadoop-evn.sh (HADOOP_HOME要設定到你的hadoop安裝目錄)

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/home/jazz/hadoop-0.18.2/
export HADOOP_CONF_DIR=$HADOOP_HOME/conf

編輯 conf/hadoop-site.xml 在 configuration 那一段加入以下設定

<property>
  <name>fs.default.name</name>
  <value>hdfs://localhost:9000/</value>
  <description>
    The name of the default file system. Either the literal string
    "local" or a host:port for NDFS.
  </description>
</property>
<property>
  <name>mapred.job.tracker</name>
  <value>localhost:9001</value>
  <description>
    The host and port that the MapReduce job tracker runs at. If
    "local", then jobs are run in-process as a single map and
    reduce task.
  </description>
</property>

啟動hadoop 的兩道指令

~/hadoop-0.18.2$ bin/hadoop namenode -format
~/hadoop-0.18.2$ bin/start-all.sh

完成後可以看到以下三個網頁

也可以放的東西上hdfs去看看

~/hadoop-0.18.2$ bin/hadoop dfs -put conf conf
~/hadoop-0.18.2$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x   - jazz supergroup          0 2008-11-04 15:56 /user/jazz/conf

Hadoop Hands-on Labs (2)

MapReduce 程式設計練習

執行 Wordcount 範例

~/hadoop-0.18.2$ bin/hadoop fs -put conf conf
~/hadoop-0.18.2$ bin/hadoop fs -ls
Found 1 items
drwxr-xr-x   - jazz supergroup          0 2008-11-05 19:34 /user/jazz/conf
~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount
ERROR: Wrong number of parameters: 0 instead of 2.
wordcount [-m <maps>] [-r <reduces>] <input> <output>
Generic options supported are
-conf <configuration file>     specify an application configuration file
-D <property=value>            use value for given property
-fs <local|namenode:port>      specify a namenode
-jt <local|jobtracker:port>    specify a job tracker
-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount conf output

Wordcount 的原始碼

~/hadoop-0.18.2/$ vi src/examples/org/apache/hadoop/examples/WordCount.java

示範 Wordcount.java 如何除錯: 故意加一段 IOException 讓 mapper 產生錯誤
```
      throw new IOException("SUICIDE");
```
使用 ant 重新編譯 hadoop-0.18.2-examples.jar
```
~/hadoop-0.18.2/$ ant examples
```
原理解說:
- 因為 key 是 Text 型態，因此要設定 OutputKeyClass 為 Text
```
    conf.setOutputKeyClass(Text.class);
```
- 詳細說明在官方文件: http://hadoop.apache.org/core/docs/r0.18.2/mapred_tutorial.html
- Input and Output Formats
  - 通常輸入跟輸出都是純文字格式，因此預設是 TextInputFormat 跟 TextOutputFormat
  - 但如果輸入跟輸出是二進位格式，那就必須使用 SequenceFileInputFormat 跟 SequenceFileOutputFormat 當作 Map/Reduce? 的 KeyClass
- Input -> InputSplit -> RecordReader
  - Hadoop 會將輸入切成很多塊 InputSplit, 但是可能會遇到要處理的資料在另一塊 InputSplit 的困擾
- Reducer 個數建議為 0.95 * num_nodes * mapred.tasktracker.tasks.maximum 這裡的 0.95 是為了預留 5% 的時間來處理其他 node 故障所造成的影響。
不會寫 Java 程式的開發者怎麼辦?
- 方法一: 使用 hadoop-stream
  - 目前處理 binary 的能力仍有限，因此建議使用在純文字處理上。
  - 如果保留原始 hadoop-site.xml 的 configure 描述(沒有加任何 <property>)，預設是使用 local filesystem
```
~/hadoop-0.18.2$ cat conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>

</configuration>

~/hadoop-0.18.2$ echo "sed -e \"s/ /\n/g\" | grep ." > streamingMapper.sh
~/hadoop-0.18.2$ echo "uniq -c | awk '{print $2 \"\t\" $1}'" > streamingReducer.sh
~/hadoop-0.18.2$ bin/hadoop jar ./contrib/streaming/hadoop-0.18.2-streaming.jar -input conf -output out1 -mapper `pwd`/streamingMapper.sh -reducer `pwd`/streamingReducer.sh
```
  - 如果有結合 DFS 的話，那必須透過 -file 指令把 mapper 跟 reducer 的程式打包進 DFS
  - 更深入的 streaming 解說文件在 http://hadoop.apache.org/core/docs/r0.18.2/streaming.html
- 方法二: 使用 Pipes (C++ native support of Hadoop)
  - 目前官方只支援 Java 跟 C++ 語言來撰寫 MapReduce 程式
  - 由於 JobTracker 還是 Java 寫的，因此必須在 Java 程式(Ex. run())裡面告訴 JobTracker 怎麼連結 C++ 的執行檔
- 方法三: 使用 Pig
  - Pig 是第三種不用學會寫 Java 而改用類似 SQL 語法的方式，Pig 會幫忙產生 MapReduce 程式(java class)，然後幫忙執行
Taiwan Hadoop User Group 所提供的 PHP + Hadoop Streaming 範例
- 用 "單機" 跟 "PHP" 開發 Hadoop 程式

大量部署 Hadoop 的方法

參閱官方文件: http://hadoop.apache.org/core/docs/r0.18.2/cluster_setup.html

Download in other formats:

Plain Text