= 2008-11-05 = * Devaraj Das 來訪 - * 10:00-12:00 Hands-on Labs (2): "Distributed Setup of Hadoop" @ 北群多媒體 * (PM) Local Sighsee * 16:00 !BioImage Regular Progress Review @ 主管會議室 == Hadoop Hands-on Labs (2) == * 執行 Wordcount 範例 {{{ ~/hadoop-0.18.2$ bin/hadoop fs -put conf conf ~/hadoop-0.18.2$ bin/hadoop fs -ls Found 1 items drwxr-xr-x - jazz supergroup 0 2008-11-05 19:34 /user/jazz/conf ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount ERROR: Wrong number of parameters: 0 instead of 2. wordcount [-m ] [-r ] Generic options supported are -conf specify an application configuration file -D use value for given property -fs specify a namenode -jt specify a job tracker -files specify comma separated files to be copied to the map reduce cluster -libjars specify comma separated jar files to include in the classpath. -archives specify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount conf output }}} * Wordcount 的原始碼 {{{ jazz@drbl:~/hadoop-0.18.2/$ vi src/examples/org/apache/hadoop/examples/WordCount.java }}} * 示範 Wordcount.java 如何除錯: 故意加一段 IOException 讓 mapper 產生錯誤 {{{ throw new IOException("SUICIDE"); }}} * 因為 key 是 Text 型態,因此要設定 !OutputKeyClass 為 Text {{{ conf.setOutputKeyClass(Text.class); }}} * 詳細說明在官方文件: http://hadoop.apache.org/core/docs/r0.18.2/mapred_tutorial.html * Input and Output Formats * 通常輸入跟輸出都是純文字格式,因此預設是 !TextInputFormat 跟 !TextOutputFormat * 但如果輸入跟輸出是二進位格式,那就必須使用 !SequenceFileInputFormat 跟 !SequenceFileOutputFormat 當作 Map/Reduce 的 !KeyClass * Input -> !InputSplit -> !RecordReader * Hadoop 會將輸入切成很多塊 !InputSplit, 但是可能會遇到要處理的資料在另一塊 !InputSplit 的困擾 * Reducer 個數建議為 0.95 * num_nodes * mapred.tasktracker.tasks.maximum 這裡的 0.95 是為了預留 5% 的時間來處理其他 node 故障所造成的影響。 * hadoop-stream: * 目前處理 binary 的能力仍有限,因此建議使用在純文字處理上。 * {{{ ~/hadoop-0.18.2$ echo "sed -e \"s/ /\n/g\" | grep ." > streamingMapper.sh ~/hadoop-0.18.2$ echo "uniq -c | awk '{print $2 \"\t\" $1}'" > streamingReducer.sh ~/hadoop-0.18.2$ bin/hadoop jar ./contrib/streaming/hadoop-0.18.2-streaming.jar -input conf -output out1 -mapper streamingMapper.sh -reducer streamingReducer.sh -file ./streamingMapper.sh -file ./streamingReducer.sh }}} * [http://www.hadoop.tw/2008/09/php-hadoop.html 用 "單機" 跟 "PHP" 開發 Hadoop 程式] * Todo: * get "Hadoop Cookbook" == Deploy Hadoop with DRBL == * Issue 1: Hadoop namenode 必須跟 datanode 在同一個網段,因此必須把 namenode 跑在 DRBL Client 上。 * Issue 2: DRBL Client 跑 datanode 不能 join namenode * namenode: 192.168.100.1 (DRBL Client 1) {{{ jazz@hadoop101:~/hadoop-0.18.2$ bin/hadoop namenode 08/11/05 09:29:31 INFO dfs.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoop101/192.168.100.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:28:09 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 08/11/05 09:28:09 INFO dfs.NameNode: Namenode up at: hadoop101/192.168.100.1:9000 08/11/05 09:28:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 08/11/05 09:28:09 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 08/11/05 09:28:09 INFO fs.FSNamesystem: fsOwner=jazz,jazz,dialout,cdrom,floppy,audio,video,plugdev 08/11/05 09:28:09 INFO fs.FSNamesystem: supergroup=supergroup 08/11/05 09:28:09 INFO fs.FSNamesystem: isPermissionEnabled=true 08/11/05 09:28:09 INFO dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 08/11/05 09:28:09 INFO fs.FSNamesystem: Registered FSNamesystemStatusMBean 08/11/05 09:28:09 INFO dfs.Storage: Number of files = 0 08/11/05 09:28:09 INFO dfs.Storage: Number of files under construction = 0 08/11/05 09:28:09 INFO dfs.Storage: Image file of size 78 loaded in 0 seconds. 08/11/05 09:28:09 INFO dfs.Storage: Edits file edits of size 4 edits # 0 loaded in 0 seconds. 08/11/05 09:28:09 INFO fs.FSNamesystem: Finished loading FSImage in 250 msecs 08/11/05 09:28:09 INFO dfs.StateChange: STATE* Leaving safe mode after 0 secs. 08/11/05 09:28:09 INFO dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes 08/11/05 09:28:09 INFO dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 08/11/05 09:28:10 INFO util.Credential: Checking Resource aliases 08/11/05 09:28:10 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:28:10 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:28:10 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1977b9b 08/11/05 09:28:11 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:28:11 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50070 08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.Server@b8deef 08/11/05 09:28:11 INFO fs.FSNamesystem: Web-server up at: 0.0.0.0:50070 08/11/05 09:28:11 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server listener on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 0 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 1 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 2 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 3 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 4 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 5 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 6 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 7 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 8 on 9000: starting 08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 9 on 9000: starting }}} {{{ jazz@hadoop101:~$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 hadoop101:9000 [::]:* LISTEN 2841/java tcp6 0 0 [::]:47279 [::]:* LISTEN 2841/java tcp6 0 0 [::]:50070 [::]:* LISTEN 2841/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 8074 2841/java }}} * datanode: 192.168.100.2 (DRBL Client 2) {{{ jazz@hadoop102:~/hadoop-0.18.2$ bin/hadoop datanode 08/11/05 09:22:05 INFO dfs.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = hadoop102/192.168.100.2 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:22:06 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted. 08/11/05 09:22:06 INFO dfs.Storage: Formatting ... 08/11/05 09:22:06 INFO dfs.DataNode: Registered FSDatasetStatusMBean 08/11/05 09:22:06 INFO dfs.DataNode: Opened info server at 50010 08/11/05 09:22:06 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s 08/11/05 09:22:06 INFO util.Credential: Checking Resource aliases 08/11/05 09:22:06 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:22:06 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:22:06 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1bb5c09 08/11/05 09:22:07 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:22:07 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075 08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.Server@15fadcf 08/11/05 09:22:07 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 08/11/05 09:22:07 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 08/11/05 09:22:07 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server listener on 50020: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 0 on 50020: starting 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 1 on 50020: starting 08/11/05 09:22:07 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop102:50010, storageID=, infoPort=50075, ipcPort=50020) 08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 2 on 50020: starting }}} {{{ jazz@hadoop102:~/hadoop-0.18.2/conf$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 [::]:50020 [::]:* LISTEN 1935/java tcp6 0 0 [::]:56743 [::]:* LISTEN 1935/java tcp6 0 0 [::]:50010 [::]:* LISTEN 1935/java tcp6 0 0 [::]:50075 [::]:* LISTEN 1935/java tcp6 0 0 hadoop102:40946 hadoop101:9000 ESTABLISHED 1935/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 3945 1935/java }}} * datanode : 192.168.100.254 / 172.21.253.129 (DRBL Server) {{{ jazz@hadoop:~/hadoop-0.18.2$ bin/hadoop datanode 08/11/05 09:26:23 INFO dfs.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = hadoop/172.21.253.129 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 ************************************************************/ 08/11/05 09:26:24 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted. 08/11/05 09:26:24 INFO dfs.Storage: Formatting ... 08/11/05 09:26:24 INFO dfs.DataNode: Registered FSDatasetStatusMBean 08/11/05 09:26:24 INFO dfs.DataNode: Opened info server at 50010 08/11/05 09:26:24 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s 08/11/05 09:26:24 INFO util.Credential: Checking Resource aliases 08/11/05 09:26:24 INFO http.HttpServer: Version Jetty/5.1.4 08/11/05 09:26:24 INFO util.Container: Started HttpContext[/static,/static] 08/11/05 09:26:24 INFO util.Container: Started HttpContext[/logs,/logs] 08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@71dc3d 08/11/05 09:26:24 INFO util.Container: Started WebApplicationContext[/,/] 08/11/05 09:26:24 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075 08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.Server@5e179a 08/11/05 09:26:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 08/11/05 09:26:24 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 08/11/05 09:26:24 INFO ipc.Server: IPC Server Responder: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server listener on 50020: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 0 on 50020: starting 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 1 on 50020: starting 08/11/05 09:26:24 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop:50010, storageID=, infoPort=50075, ipcPort=50020) 08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 2 on 50020: starting 08/11/05 09:26:24 INFO dfs.DataNode: New storage id DS-1165131249-172.21.253.129-50010-1225848384798 is assigned to data-node 192.168.100.254:50010 08/11/05 09:26:24 INFO dfs.DataNode: DatanodeRegistration(192.168.100.254:50010, storageID=DS-1165131249-172.21.253.129-50010-1225848384798, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-jazz/dfs/data/current'} 08/11/05 09:26:24 INFO dfs.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec 08/11/05 09:26:24 INFO dfs.DataNode: Starting Periodic block scanner. 08/11/05 09:26:27 INFO dfs.DataNode: BlockReport of 0 blocks got processed in 8 msecs }}} {{{ jazz@hadoop:~$ sudo netstat -ap Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 [::]:50020 [::]:* LISTEN 8085/java tcp6 0 0 [::]:1389 [::]:* LISTEN 8085/java tcp6 0 0 [::]:50010 [::]:* LISTEN 8085/java tcp6 0 0 [::]:50075 [::]:* LISTEN 8085/java tcp6 0 0 hadoop-eth1:3590 hadoop101:9000 ESTABLISHED 8085/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 16749 8085/java }}} * 此時,namenode 的 console output 多了兩行 {{{ 08/11/05 09:42:59 INFO dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.100.254:50010 storage DS-1165131249-172.21.253.129-50010-1225848384798 08/11/05 09:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.100.254:50010 }}} {{{ jazz@hadoop101:~$ sudo netstat -ap [sudo] password for jazz: Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 hadoop101:9000 [::]:* LISTEN 2703/java tcp6 0 0 [::]:50070 [::]:* LISTEN 2703/java tcp6 0 0 [::]:58751 [::]:* LISTEN 2703/java tcp6 0 0 hadoop101:9000 hadoop-eth1:3590 ESTABLISHED 2703/java Active UNIX domain sockets (servers and established) Proto RefCnt Flags Type State I-Node PID/Program name Path unix 2 [ ] STREAM CONNECTED 7492 2703/java }}} == debian package post-install script == * Q: 由於在 DRBL Server 上新增套件,有時候需要重新 re-deploy,因此如何讓 apt-get install 或 dpkg -i 之後可以自動執行某個 script 去做 re-deploy 的動作呢? * [參考] 像 localepurge 套件會在每次安裝完之後重新檢查 locale 目錄是否有多餘的語系檔案 * [http://packages.debian.org/etch/all/localepurge localepurge] - Automagically remove unnecessary locale data * localepurge 的作法是擺一個 script 到 /etc/apt/apt.conf.d/,可參考 /etc/apt/apt.conf.d/99-localepurge [http://packages.debian.org/etch/all/localepurge/filelist 1]