[[PageOutline]] = 2008-11-05 = * Devaraj Das 來訪 - * 10:00-12:00 Hands-on Labs (2): "Distributed Setup of Hadoop" @ 北群多媒體 * (PM) Local Sighsee * 16:00 !BioImage Regular Progress Review @ 主管會議室 == Hadoop Hands-on Labs (2) == * 執行 Wordcount 範例 {{{ ~/hadoop-0.18.2$ bin/hadoop fs -put conf conf ~/hadoop-0.18.2$ bin/hadoop fs -ls Found 1 items drwxr-xr-x - jazz supergroup 0 2008-11-05 19:34 /user/jazz/conf ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount ERROR: Wrong number of parameters: 0 instead of 2. wordcount [-m ] [-r ] Generic options supported are -conf specify an application configuration file -D use value for given property -fs specify a namenode -jt specify a job tracker -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount conf output }}} * Wordcount 的原始碼 {{{ jazz@drbl:~/hadoop-0.18.2/$ vi src/examples/org/apache/hadoop/examples/WordCount.java }}} * 示範 Wordcount.java 如何除錯: 故意加一段 IOException 讓 mapper 產生錯誤 {{{ throw new IOException("SUICIDE"); }}} * 因為 key 是 Text 型態,因此要設定 !OutputKeyClass 為 Text {{{ conf.setOutputKeyClass(Text.class); }}} * 詳細說明在官方文件: http://hadoop.apache.org/core/docs/r0.18.2/mapred_tutorial.html * Input and Output Formats * 通常輸入跟輸出都是純文字格式,因此預設是 !TextInputFormat 跟 !TextOutputFormat * 但如果輸入跟輸出是二進位格式,那就必須使用 !SequenceFileInputFormat 跟 !SequenceFileOutputFormat 當作 Map/Reduce 的 !KeyClass * Input -> !InputSplit -> !RecordReader * Hadoop 會將輸入切成很多塊 !InputSplit, 但是可能會遇到要處理的資料在另一塊 !InputSplit 的困擾 * Reducer 個數建議為 0.95 * num_nodes * mapred.tasktracker.tasks.maximum 這裡的 0.95 是為了預留 5% 的時間來處理其他 node 故障所造成的影響。 * hadoop-stream: * 目前處理 binary 的能力仍有限,因此建議使用在純文字處理上。 * 如果保留原始 hadoop-site.xml 的 configure 描述(沒有加任何 ),預設是使用 local filesystem {{{ ~/hadoop-0.18.2$ cat conf/hadoop-site.xml ~/hadoop-0.18.2$ echo "sed -e \"s/ /\n/g\" | grep ." > streamingMapper.sh ~/hadoop-0.18.2$ echo "uniq -c | awk '{print $2 \"\t\" $1}'" > streamingReducer.sh ~/hadoop-0.18.2$ bin/hadoop jar ./contrib/streaming/hadoop-0.18.2-streaming.jar -input conf -output out1 -mapper `pwd`/streamingMapper.sh -reducer `pwd`/streamingReducer.sh }}} * 如果有結合 DFS 的話,那必須透過 -file 指令把 mapper 跟 reducer 的程式打包進 DFS {{{ }}} * 更深入的 streaming 解說文件在 http://hadoop.apache.org/core/docs/r0.18.2/streaming.html * Pipes (C++ native support of Hadoop) * 目前官方只支援 Java 跟 C++ 語言來撰寫 MapReduce 程式 * 由於 JobTracker 還是 Java 寫的,因此必須在 Java 程式(Ex. run())裡面告訴 JobTracker 怎麼連結 C++ 的執行檔 * Pig 是第三種不用學會寫 Java 而改用類似 SQL 語法的方式,Pig 會幫忙產生 MapReduce 程式(java class),然後幫忙執行 * [http://www.hadoop.tw/2008/09/php-hadoop.html 用 "單機" 跟 "PHP" 開發 Hadoop 程式] * 大量部署 Hadoop 的方法 * 參閱官方文件: http://hadoop.apache.org/core/docs/r0.18.2/cluster_setup.html * Todo:類別 * get "Hadoop Cookbook" == Deploy Hadoop with DRBL == * Issue 1: Hadoop namenode 必須跟 datanode 在同一個網段,因此必須把 namenode 跑在 DRBL Client 上。 * Issue 2: DRBL Client 跑 datanode 不能 join namenode * namenode: 192.168.100.1 (DRBL Client 1) {{{ jazz@hadoop101:~/hadoop-0.18.2$ bin/hadoop namenode 08/11/05 09:29:31 INFO dfs.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop101/192.168.100.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:28:09 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 08/11/05 09:28:09 INFO dfs.NameNode: Namenode up at: hadoop101/192.168.100.1:9000 08/11/05 09:28:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 08/11/05 09:28:09 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 08/11/05 09:28:09 INFO fs.FSNamesystem: fsOwner=jazz,jazz,dialout,cdrom,floppy,audio,video,plugdev 08/11/05 09:28:09 INFO fs.FSNamesystem: supergroup=supergroup 08/11/05 09:28:09 INFO fs.FSNamesystem: isPermissionEnabled=true 08/11/05 09:28:09 INFO dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 08/11/05 09:28:09 INFO fs.FSNamesystem: Registered FSNamesystemStatusMBean 08/11/05 09:28:09 INFO dfs.Storage: Number of files = 0 08/11/05 09:28:09 INFO dfs.Storage: Number of files under construction = 0 08/11/05 09:28:09 INFO dfs.Storage: Image file of size 78 loaded in 0 seconds. 08/11/05 09:28:09 INFO dfs.Storage: Edits file edits of size 4 edits # 0 loaded in 0 seconds. 08/11/05 09:28:09 INFO fs.FSNamesystem: Finished loading FSImage in 250 msecs 08/11/05 09:28:09 INFO dfs.StateChange: STATE* Leaving safe mode after 0 secs. 08/11/05 09:28:09 INFO dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 08/11/05 09:28:09 INFO dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 08/11/05 09:28:10 INFO util.Credential: Checking Resource aliases 08/11/05 09:28:10 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:28:10 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:28:10 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1977b9b 08/11/05 09:28:11 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:28:11 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50070 08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.Server@b8deef 08/11/05 09:28:11 INFO fs.FSNamesystem: Web-server up at: 0.0.0.0:50070 08/11/05 09:28:11 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server listener on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 0 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 1 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 2 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 3 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 4 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 5 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 6 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 7 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 8 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 9 on 9000: starting }}} {{{ jazz@hadoop101:~$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 hadoop101:9000 [::]:* LISTEN 2841/java tcp6 0 0 [::]:47279 [::]:* LISTEN 2841/java tcp6 0 0 [::]:50070 [::]:* LISTEN 2841/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 8074 2841/java }}} * datanode: 192.168.100.2 (DRBL Client 2) {{{ jazz@hadoop102:~/hadoop-0.18.2$ bin/hadoop datanode 08/11/05 09:22:05 INFO dfs.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = hadoop102/192.168.100.2 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:22:06 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted. 08/11/05 09:22:06 INFO dfs.Storage: Formatting ... 08/11/05 09:22:06 INFO dfs.DataNode: Registered FSDatasetStatusMBean 08/11/05 09:22:06 INFO dfs.DataNode: Opened info server at 50010 08/11/05 09:22:06 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s 08/11/05 09:22:06 INFO util.Credential: Checking Resource aliases 08/11/05 09:22:06 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:22:06 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:22:06 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1bb5c09 08/11/05 09:22:07 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:22:07 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075 08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.Server@15fadcf 08/11/05 09:22:07 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 08/11/05 09:22:07 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 08/11/05 09:22:07 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server listener on 50020: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 0 on 50020: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 1 on 50020: starting 08/11/05 09:22:07 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop102:50010, storageID=, infoPort=50075, ipcPort=50020) 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 2 on 50020: starting }}} {{{ jazz@hadoop102:~/hadoop-0.18.2/conf$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 [::]:50020 [::]:* LISTEN 1935/java tcp6 0 0 [::]:56743 [::]:* LISTEN 1935/java tcp6 0 0 [::]:50010 [::]:* LISTEN 1935/java tcp6 0 0 [::]:50075 [::]:* LISTEN 1935/java tcp6 0 0 hadoop102:40946 hadoop101:9000 ESTABLISHED 1935/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 3945 1935/java }}} * datanode : 192.168.100.254 / 172.21.253.129 (DRBL Server) {{{ ~/hadoop-0.18.2$ bin/hadoop datanode 08/11/05 09:26:23 INFO dfs.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = hadoop/172.21.253.129 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:26:24 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted. 08/11/05 09:26:24 INFO dfs.Storage: Formatting ... 08/11/05 09:26:24 INFO dfs.DataNode: Registered FSDatasetStatusMBean 08/11/05 09:26:24 INFO dfs.DataNode: Opened info server at 50010 08/11/05 09:26:24 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s 08/11/05 09:26:24 INFO util.Credential: Checking Resource aliases 08/11/05 09:26:24 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:26:24 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:26:24 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@71dc3d 08/11/05 09:26:24 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:26:24 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075 08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.Server@5e179a 08/11/05 09:26:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 08/11/05 09:26:24 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 08/11/05 09:26:24 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server listener on 50020: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 0 on 50020: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 1 on 50020: starting 08/11/05 09:26:24 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop:50010, storageID=, infoPort=50075, ipcPort=50020) 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 2 on 50020: starting 08/11/05 09:26:24 INFO dfs.DataNode: New storage id DS-1165131249-172.21.253.129-50010-1225848384798 is assigned to data-node 192.168.100.254:50010 08/11/05 09:26:24 INFO dfs.DataNode: DatanodeRegistration(192.168.100.254:50010, storageID=DS-1165131249-172.21.253.129-50010-1225848384798, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-jazz/dfs/data/current'} 08/11/05 09:26:24 INFO dfs.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 08/11/05 09:26:24 INFO dfs.DataNode: Starting Periodic block scanner. 08/11/05 09:26:27 INFO dfs.DataNode: BlockReport of 0 blocks got processed in 8 msecs }}} {{{ jazz@hadoop:~$ sudo netstat -ap Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 [::]:50020 [::]:* LISTEN 8085/java tcp6 0 0 [::]:1389 [::]:* LISTEN 8085/java tcp6 0 0 [::]:50010 [::]:* LISTEN 8085/java tcp6 0 0 [::]:50075 [::]:* LISTEN 8085/java tcp6 0 0 hadoop-eth1:3590 hadoop101:9000 ESTABLISHED 8085/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 16749 8085/java }}} * 此時,namenode 的 console output 多了兩行 {{{ 08/11/05 09:42:59 INFO dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.100.254:50010 storage DS-1165131249-172.21.253.129-50010-1225848384798 08/11/05 09:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.100.254:50010 }}} {{{ jazz@hadoop101:~$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 hadoop101:9000 [::]:* LISTEN 2703/java tcp6 0 0 [::]:50070 [::]:* LISTEN 2703/java tcp6 0 0 [::]:58751 [::]:* LISTEN 2703/java tcp6 0 0 hadoop101:9000 hadoop-eth1:3590 ESTABLISHED 2703/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 7492 2703/java }}} * 使用 strace 追蹤 * 使用 bash -x 追蹤 {{{ ~/hadoop-0.18.2$ bash -x bin/hadoop datanode }}} * 使用 tcpdump 追蹤 - 監控與 namenode 之間的通訊 {{{ ~/hadoop-0.18.2$ tcpdump -i eth0 dst port 9000 }}} * 從 tcpdump 的結果發現 datanode (DRBL client 2) 有跟 namenode (DRBL client 1) 溝通,但是不同於 DRBL Server 當 datanode 時會完成與 namenode 之間連線,整體上看起來是一直嘗試與 namenode 連線的狀態,因此懷疑是否有 permission 認證方面的問題。 * 印象 waue 曾經用 DRBL 裝兩台 Hadoop,也因而遇到兩台同時搶 NFS 空間的窘境,因此進一步實驗如果在 DRBL client 2 上掛載 /dev/sda2 並把 conf/hadoop-env.sh 裡的 HADOOP_HOME 設成實體硬碟空間是否可行。 * [結論] Hadoop 會用 df 查詢到底有多少實際可用的空間,這從 http://x.x.x.x:50070 namenode 的管理介面就可以看到。因此或許 0.18.2 的版本有新的防範措施,以致於 datanode (DRBL Client 2) 無法找到 Storage 加入 namenode (DRBL Client 1)。 == debian package post-install script == * Q: 由於在 DRBL Server 上新增套件,有時候需要重新 re-deploy,因此如何讓 apt-get install 或 dpkg -i 之後可以自動執行某個 script 去做 re-deploy 的動作呢? * [參考] 像 localepurge 套件會在每次安裝完之後重新檢查 locale 目錄是否有多餘的語系檔案 * [http://packages.debian.org/etch/all/localepurge localepurge] - Automagically remove unnecessary locale data * localepurge 的作法是擺一個 script 到 /etc/apt/apt.conf.d/,可參考 /etc/apt/apt.conf.d/99-localepurge [http://packages.debian.org/etch/all/localepurge/filelist 1]