wiki:jazz/08-11-05

2008-11-05

  • Devaraj Das 來訪 -
    • 10:00-12:00 Hands-on Labs (2): "Distributed Setup of Hadoop" @ 北群多媒體
    • (PM) Local Sighsee
  • 16:00 BioImage Regular Progress Review @ 主管會議室

Hadoop Hands-on Labs (2)

  • 執行 Wordcount 範例
    ~/hadoop-0.18.2$ bin/hadoop fs -put conf conf
    ~/hadoop-0.18.2$ bin/hadoop fs -ls
    Found 1 items
    drwxr-xr-x   - jazz supergroup          0 2008-11-05 19:34 /user/jazz/conf
    ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount
    ERROR: Wrong number of parameters: 0 instead of 2.
    wordcount [-m <maps>] [-r <reduces>] <input> <output>
    Generic options supported are
    -conf <configuration file>     specify an application configuration file
    -D <property=value>            use value for given property
    -fs <local|namenode:port>      specify a namenode
    -jt <local|jobtracker:port>    specify a job tracker
    -files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster
    -libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.
    -archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.
    
    The general command line syntax is
    bin/hadoop command [genericOptions] [commandOptions]
    ~/hadoop-0.18.2$ bin/hadoop jar /home/jazz/hadoop-0.18.2/hadoop-0.18.2-examples.jar wordcount conf output
    
  • Wordcount 的原始碼
    jazz@drbl:~/hadoop-0.18.2/$ vi src/examples/org/apache/hadoop/examples/WordCount.java
    
  • 示範 Wordcount.java 如何除錯: 故意加一段 IOException 讓 mapper 產生錯誤
          throw new IOException("SUICIDE");
    
  • 因為 key 是 Text 型態,因此要設定 OutputKeyClass 為 Text
        conf.setOutputKeyClass(Text.class);
    
  • 詳細說明在官方文件: http://hadoop.apache.org/core/docs/r0.18.2/mapred_tutorial.html
  • Input and Output Formats
    • 通常輸入跟輸出都是純文字格式,因此預設是 TextInputFormat 跟 TextOutputFormat
    • 但如果輸入跟輸出是二進位格式,那就必須使用 SequenceFileInputFormat 跟 SequenceFileOutputFormat 當作 Map/Reduce? 的 KeyClass
  • Input -> InputSplit -> RecordReader
    • Hadoop 會將輸入切成很多塊 InputSplit, 但是可能會遇到要處理的資料在另一塊 InputSplit 的困擾
  • Reducer 個數建議為 0.95 * num_nodes * mapred.tasktracker.tasks.maximum 這裡的 0.95 是為了預留 5% 的時間來處理其他 node 故障所造成的影響。
  • hadoop-stream:
    • 目前處理 binary 的能力仍有限,因此建議使用在純文字處理上。
    • 如果保留原始 hadoop-site.xml 的 configure 描述(沒有加任何 <property>),預設是使用 local filesystem
      ~/hadoop-0.18.2$ cat conf/hadoop-site.xml
      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
      
      </configuration>
      
      ~/hadoop-0.18.2$ echo "sed -e \"s/ /\n/g\" | grep ." > streamingMapper.sh
      ~/hadoop-0.18.2$ echo "uniq -c | awk '{print $2 \"\t\" $1}'" > streamingReducer.sh
      ~/hadoop-0.18.2$ bin/hadoop jar ./contrib/streaming/hadoop-0.18.2-streaming.jar -input conf -output out1 -mapper `pwd`/streamingMapper.sh -reducer `pwd`/streamingReducer.sh
      
    • 如果有結合 DFS 的話,那必須透過 -file 指令把 mapper 跟 reducer 的程式打包進 DFS
    • 更深入的 streaming 解說文件在 http://hadoop.apache.org/core/docs/r0.18.2/streaming.html
  • Pipes (C++ native support of Hadoop)
    • 目前官方只支援 Java 跟 C++ 語言來撰寫 MapReduce 程式
    • 由於 JobTracker? 還是 Java 寫的,因此必須在 Java 程式(Ex. run())裡面告訴 JobTracker? 怎麼連結 C++ 的執行檔
  • Pig 是第三種不用學會寫 Java 而改用類似 SQL 語法的方式,Pig 會幫忙產生 MapReduce 程式(java class),然後幫忙執行
  • 用 "單機" 跟 "PHP" 開發 Hadoop 程式
  • 大量部署 Hadoop 的方法
  • Todo:類別
    • get "Hadoop Cookbook"

Deploy Hadoop with DRBL

  • Issue 1: Hadoop namenode 必須跟 datanode 在同一個網段,因此必須把 namenode 跑在 DRBL Client 上。
  • Issue 2: DRBL Client 跑 datanode 不能 join namenode
  • namenode: 192.168.100.1 (DRBL Client 1)
    jazz@hadoop101:~/hadoop-0.18.2$ bin/hadoop namenode
    08/11/05 09:29:31 INFO dfs.NameNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting NameNode
    STARTUP_MSG:   host = hadoop101/192.168.100.1
    STARTUP_MSG:   args = []
    STARTUP_MSG:   version = 0.18.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
    ************************************************************/
    08/11/05 09:28:09 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000
    08/11/05 09:28:09 INFO dfs.NameNode: Namenode up at: hadoop101/192.168.100.1:9000
    08/11/05 09:28:09 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null
    08/11/05 09:28:09 INFO dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
    08/11/05 09:28:09 INFO fs.FSNamesystem: fsOwner=jazz,jazz,dialout,cdrom,floppy,audio,video,plugdev
    08/11/05 09:28:09 INFO fs.FSNamesystem: supergroup=supergroup
    
    08/11/05 09:28:09 INFO fs.FSNamesystem: isPermissionEnabled=true
    08/11/05 09:28:09 INFO dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
    08/11/05 09:28:09 INFO fs.FSNamesystem: Registered FSNamesystemStatusMBean
    08/11/05 09:28:09 INFO dfs.Storage: Number of files = 0
    08/11/05 09:28:09 INFO dfs.Storage: Number of files under construction = 0
    08/11/05 09:28:09 INFO dfs.Storage: Image file of size 78 loaded in 0 seconds.
    08/11/05 09:28:09 INFO dfs.Storage: Edits file edits of size 4 edits # 0 loaded in 0 seconds.
    08/11/05 09:28:09 INFO fs.FSNamesystem: Finished loading FSImage in 250 msecs
    08/11/05 09:28:09 INFO dfs.StateChange: STATE* Leaving safe mode after 0 secs.
    08/11/05 09:28:09 INFO dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes
    08/11/05 09:28:09 INFO dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
    08/11/05 09:28:10 INFO util.Credential: Checking Resource aliases
    08/11/05 09:28:10 INFO http.HttpServer: Version Jetty/5.1.4
    08/11/05 09:28:10 INFO util.Container: Started HttpContext[/static,/static]
    08/11/05 09:28:10 INFO util.Container: Started HttpContext[/logs,/logs]
    08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1977b9b
    08/11/05 09:28:11 INFO util.Container: Started WebApplicationContext[/,/]
    08/11/05 09:28:11 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50070
    08/11/05 09:28:11 INFO util.Container: Started org.mortbay.jetty.Server@b8deef
    08/11/05 09:28:11 INFO fs.FSNamesystem: Web-server up at: 0.0.0.0:50070
    08/11/05 09:28:11 INFO ipc.Server: IPC Server Responder: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server listener on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 0 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 1 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 2 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 3 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 4 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 5 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 6 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 7 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 8 on 9000: starting
    08/11/05 09:28:11 INFO ipc.Server: IPC Server handler 9 on 9000: starting
    
    jazz@hadoop101:~$ sudo netstat -ap
    [sudo] password for jazz:
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp6       0      0 hadoop101:9000          [::]:*                  LISTEN      2841/java
    tcp6       0      0 [::]:47279              [::]:*                  LISTEN      2841/java
    tcp6       0      0 [::]:50070              [::]:*                  LISTEN      2841/java
    
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
    unix  2      [ ]         STREAM     CONNECTED     8074     2841/java
    
  • datanode: 192.168.100.2 (DRBL Client 2)
    jazz@hadoop102:~/hadoop-0.18.2$ bin/hadoop datanode
    08/11/05 09:22:05 INFO dfs.DataNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting DataNode
    STARTUP_MSG:   host = hadoop102/192.168.100.2
    STARTUP_MSG:   args = []
    STARTUP_MSG:   version = 0.18.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
    ************************************************************/
    08/11/05 09:22:06 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted.
    08/11/05 09:22:06 INFO dfs.Storage: Formatting ...
    08/11/05 09:22:06 INFO dfs.DataNode: Registered FSDatasetStatusMBean
    08/11/05 09:22:06 INFO dfs.DataNode: Opened info server at 50010
    08/11/05 09:22:06 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s
    08/11/05 09:22:06 INFO util.Credential: Checking Resource aliases
    08/11/05 09:22:06 INFO http.HttpServer: Version Jetty/5.1.4
    08/11/05 09:22:06 INFO util.Container: Started HttpContext[/static,/static]
    08/11/05 09:22:06 INFO util.Container: Started HttpContext[/logs,/logs]
    08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@1bb5c09
    08/11/05 09:22:07 INFO util.Container: Started WebApplicationContext[/,/]
    08/11/05 09:22:07 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075
    08/11/05 09:22:07 INFO util.Container: Started org.mortbay.jetty.Server@15fadcf
    08/11/05 09:22:07 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null
    08/11/05 09:22:07 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
    08/11/05 09:22:07 INFO ipc.Server: IPC Server Responder: starting
    08/11/05 09:22:07 INFO ipc.Server: IPC Server listener on 50020: starting
    08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 0 on 50020: starting
    08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 1 on 50020: starting
    08/11/05 09:22:07 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop102:50010, storageID=, infoPort=50075, ipcPort=50020)
    08/11/05 09:22:07 INFO ipc.Server: IPC Server handler 2 on 50020: starting
    
    jazz@hadoop102:~/hadoop-0.18.2/conf$ sudo netstat -ap
    [sudo] password for jazz:
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp6       0      0 [::]:50020              [::]:*                  LISTEN      1935/java
    tcp6       0      0 [::]:56743              [::]:*                  LISTEN      1935/java
    tcp6       0      0 [::]:50010              [::]:*                  LISTEN      1935/java
    tcp6       0      0 [::]:50075              [::]:*                  LISTEN      1935/java
    tcp6       0      0 hadoop102:40946         hadoop101:9000          ESTABLISHED 1935/java
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
    unix  2      [ ]         STREAM     CONNECTED     3945     1935/java
    
  • datanode : 192.168.100.254 / 172.21.253.129 (DRBL Server)
    ~/hadoop-0.18.2$ bin/hadoop datanode
    08/11/05 09:26:23 INFO dfs.DataNode: STARTUP_MSG:
    /************************************************************
    STARTUP_MSG: Starting DataNode
    STARTUP_MSG:   host = hadoop/172.21.253.129
    STARTUP_MSG:   args = []
    STARTUP_MSG:   version = 0.18.2
    STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
    ************************************************************/
    08/11/05 09:26:24 INFO dfs.Storage: Storage directory /tmp/hadoop-jazz/dfs/data is not formatted.
    08/11/05 09:26:24 INFO dfs.Storage: Formatting ...
    08/11/05 09:26:24 INFO dfs.DataNode: Registered FSDatasetStatusMBean
    08/11/05 09:26:24 INFO dfs.DataNode: Opened info server at 50010
    08/11/05 09:26:24 INFO dfs.DataNode: Balancing bandwith is 1048576 bytes/s
    08/11/05 09:26:24 INFO util.Credential: Checking Resource aliases
    08/11/05 09:26:24 INFO http.HttpServer: Version Jetty/5.1.4
    08/11/05 09:26:24 INFO util.Container: Started HttpContext[/static,/static]
    08/11/05 09:26:24 INFO util.Container: Started HttpContext[/logs,/logs]
    08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.servlet.WebApplicationHandler@71dc3d
    08/11/05 09:26:24 INFO util.Container: Started WebApplicationContext[/,/]
    08/11/05 09:26:24 INFO http.SocketListener: Started SocketListener on 0.0.0.0:50075
    08/11/05 09:26:24 INFO util.Container: Started org.mortbay.jetty.Server@5e179a
    08/11/05 09:26:24 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null
    08/11/05 09:26:24 INFO metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020
    08/11/05 09:26:24 INFO ipc.Server: IPC Server Responder: starting
    08/11/05 09:26:24 INFO ipc.Server: IPC Server listener on 50020: starting
    08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 0 on 50020: starting
    08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 1 on 50020: starting
    08/11/05 09:26:24 INFO dfs.DataNode: dnRegistration = DatanodeRegistration(hadoop:50010, storageID=, infoPort=50075, ipcPort=50020)
    08/11/05 09:26:24 INFO ipc.Server: IPC Server handler 2 on 50020: starting
    08/11/05 09:26:24 INFO dfs.DataNode: New storage id DS-1165131249-172.21.253.129-50010-1225848384798 is assigned to data-node 192.168.100.254:50010
    08/11/05 09:26:24 INFO dfs.DataNode: DatanodeRegistration(192.168.100.254:50010, storageID=DS-1165131249-172.21.253.129-50010-1225848384798, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-jazz/dfs/data/current'}
    08/11/05 09:26:24 INFO dfs.DataNode: using BLOCKREPORT_INTERVAL of 3600000msec Initial delay: 0msec
    08/11/05 09:26:24 INFO dfs.DataNode: Starting Periodic block scanner.
    08/11/05 09:26:27 INFO dfs.DataNode: BlockReport of 0 blocks got processed in 8 msecs
    
    jazz@hadoop:~$ sudo netstat -ap
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp6       0      0 [::]:50020              [::]:*                  LISTEN      8085/java
    tcp6       0      0 [::]:1389               [::]:*                  LISTEN      8085/java
    tcp6       0      0 [::]:50010              [::]:*                  LISTEN      8085/java
    tcp6       0      0 [::]:50075              [::]:*                  LISTEN      8085/java
    tcp6       0      0 hadoop-eth1:3590        hadoop101:9000          ESTABLISHED 8085/java
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
    unix  2      [ ]         STREAM     CONNECTED     16749    8085/java
    
  • 此時,namenode 的 console output 多了兩行
    08/11/05 09:42:59 INFO dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.100.254:50010 storage DS-1165131249-172.21.253.129-50010-1225848384798
    08/11/05 09:42:59 INFO net.NetworkTopology: Adding a new node: /default-rack/192.168.100.254:50010
    
    jazz@hadoop101:~$ sudo netstat -ap
    [sudo] password for jazz:
    Active Internet connections (servers and established)
    Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
    tcp6       0      0 hadoop101:9000          [::]:*                  LISTEN      2703/java
    tcp6       0      0 [::]:50070              [::]:*                  LISTEN      2703/java
    tcp6       0      0 [::]:58751              [::]:*                  LISTEN      2703/java
    tcp6       0      0 hadoop101:9000          hadoop-eth1:3590        ESTABLISHED 2703/java
    Active UNIX domain sockets (servers and established)
    Proto RefCnt Flags       Type       State         I-Node   PID/Program name    Path
    unix  2      [ ]         STREAM     CONNECTED     7492     2703/java
    
  • 使用 strace 追蹤
  • 使用 bash -x 追蹤
    ~/hadoop-0.18.2$ bash -x bin/hadoop datanode
    
  • 使用 tcpdump 追蹤 - 監控與 namenode 之間的通訊
    ~/hadoop-0.18.2$ tcpdump -i eth0 dst port 9000
    
  • 從 tcpdump 的結果發現 datanode (DRBL client 2) 有跟 namenode (DRBL client 1) 溝通,但是不同於 DRBL Server 當 datanode 時會完成與 namenode 之間連線,整體上看起來是一直嘗試與 namenode 連線的狀態,因此懷疑是否有 permission 認證方面的問題。
  • 印象 waue 曾經用 DRBL 裝兩台 Hadoop,也因而遇到兩台同時搶 NFS 空間的窘境,因此進一步實驗如果在 DRBL client 2 上掛載 /dev/sda2 並把 conf/hadoop-env.sh 裡的 HADOOP_HOME 設成實體硬碟空間是否可行。
  • [結論] Hadoop 會用 df 查詢到底有多少實際可用的空間,這從 http://x.x.x.x:50070 namenode 的管理介面就可以看到。因此或許 0.18.2 的版本有新的防範措施,以致於 datanode (DRBL Client 2) 無法找到 Storage 加入 namenode (DRBL Client 1)。

debian package post-install script

  • Q: 由於在 DRBL Server 上新增套件,有時候需要重新 re-deploy,因此如何讓 apt-get install 或 dpkg -i 之後可以自動執行某個 script 去做 re-deploy 的動作呢?
  • [參考] 像 localepurge 套件會在每次安裝完之後重新檢查 locale 目錄是否有多餘的語系檔案
    • localepurge - Automagically remove unnecessary locale data
    • localepurge 的作法是擺一個 script 到 /etc/apt/apt.conf.d/,可參考 /etc/apt/apt.conf.d/99-localepurge 1
Last modified 16 years ago Last modified on Nov 10, 2008, 11:55:13 AM