wiki:NCHCCloudCourse100802/Lab1

Version 9 (modified by jazz, 14 years ago) (diff)

--

實作一: Hadoop 0.20 單機安裝
Lab1: Hadoop Installation: single node, pseudo-distributed

前言 Preface

  • 本課程實作之電腦教室所提供的作業環境是 Ubuntu 9.04 Desktop 桌面的環境。
    This lab is based on Ubuntu 9.04 Desktop.
  • 本頁面的部分指令,是針對不熟悉 Linux 文字編輯器的使用者所設計的'懶人'設定法,您也可以使用習慣使用的文字編輯器(如:vi,nano,joe等)進行修改。
    The instructions listed here are suitable for users who are not familiar to Linux editor. You could also use other editors, such as vi, nano, joe, etc.
  • 黑底白字的部分為指令或console秀出的畫面,請自行剪貼提示符號 "$"(代表一般使用者) 或 "#"(代表最高權限 root 管理者) 之後的指令。如:
    For the command shown in white color with black background, please copy and paste them to console. Please note that the prompt "$" is for normal user and prompt "#" is for super user. Here is an example:
    /home/DIR$ Copy_Command From To ...
    
    則複製 Copy_Command From To ... 這個指令,貼到你的console來執行。(/home/DIR 代表目前所在的目錄路徑)
    Please copy the command Copy_Command From To ... and paste to your console for excution. ( "/home/DIR" stands for the working directory )
  • 白底黑字的部分為文件內的內容 ,如
    For the command shown in black color with white background, please copy and paste them to editor (such as gnome editor in Gnome Desktop). Here is an example:
    I am context.
    
    如果熟悉vi,nano,joe等編輯器可複製此區內容貼到文件內(雖然此頁面的指令都已經簡化過)
    If you're familiar with editors like vi, nano or joe, please copy and paste them to suitable position of the configuration files ( even commands were simplified in this page)
  • 登入資訊 Login Information
使用者 User Name hadooper
群組 Group hadooper
密碼 Password
  • Hadooper 擁有 sudoer 的權限
    hadooper have the permission of sudo to execute commands of super user.

Step 1: 設定登入免密碼

Step 1: Setup SSH key exchange

  • 由於 Hadoop 用 ssh 作機器間的溝通,因此先設定登入機器免密碼的設定,
    Since the start-all.sh and stop-all.sh bash scripts are both based on SSH, you have to setup SSH key exchange for the convenience of executing following steps.
~$ ssh-keygen -t rsa -f ~/.ssh/id_rsa -P ""
~$ cp ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys

完成後請登入確認不用輸入密碼,(第一次登入需按enter鍵,第二次就可以直接登入到系統),以免日後輸入密碼key到手軟
To make sure that you've configure it correct, you could try following commands. Press Enter for the first time, and it should be directly login for the 2nd time.

~$ ssh localhost
~$ exit
~$ ssh localhost
~$ exit

Step 2. 安裝 Java

Step 2. Install Java

  • 由於 Sun Java Runtime 是執行 Hadoop 必備的工具,因此我們需要安裝 JRE 或 JDK。這裡我們直接安裝 JDK,因為後面寫程式仍舊需要用到 JDK 所提供的編譯器。目前 Ubuntu 9.04 提供的 JDK 套件最新版本為 Sun Java(TM) Development Kit (JDK) 6.06 ,套件名稱為 sun-java6-jdk。並建議刪除原本的 「 gcj 」 。
    Since Hadoop is written by Java, Java Runtime Environment are required for execution. We recommand to use Sun Java Development Kit (a.k.a JDK) because we need to compile java source code later. By default Ubuntu 9.04 installed gcj (java-gcj-compat), but we recommand to remove the package and install Sun Java(TM) Development Kit (JDK) 6.06 (sun-java6-jdk) instead.
~$ sudo apt-get purge java-gcj-compat
~$ sudo apt-get install sun-java6-bin  sun-java6-jdk sun-java6-jre

Step 3: 下載安裝 Hadoop

Step 3: Download Hadoop Source Package

  • 請至國網中心 TWAREN 的鏡射站下載 Hadoop 0.20.2,並解開壓縮檔到 /opt 路徑。
    Please download Hadoop 0.20.2 source package from NCHC TWAREN mirror site and extract the archive file to /opt directory.
~$ cd /opt
/opt$ sudo wget http://ftp.twaren.net/Unix/Web/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
/opt$ sudo tar zxvf hadoop-0.20.2.tar.gz
/opt$ sudo mv hadoop-0.20.2/ hadoop
/opt$ sudo chown -R hadooper:hadooper hadoop
/opt$ sudo mkdir /var/hadoop
/opt$ sudo chown -R hadooper:hadooper /var/hadoop
  • 現在您已經準備好可以開始使用 Hadoop 叢集,底下是三種使用模式:
    Now you are ready to start your Hadoop cluster in one of the three supported modes:
    • 單機模式
      Local (Standalone) Mode
    • 偽分散模式
      Pseudo-Distributed Mode
    • 全分散模式
      Fully-Distributed Mode
  • 預設 Hadoop 已經設定成可以執行單機模式。這種模式很適合拿來除錯。底下的範例將把 conf 目錄當作輸入,並且找尋符合標準表示法的結果。輸出將寫在指定的 output 目錄。
    By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
/opt$ cd hadoop
/opt/hadoop$ mkdir input
/opt/hadoop$ cp conf/*.xml input
/opt/hadoop$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
/opt/hadoop$ cat output/*

Step 4: 設定 hadoop-env.sh

Step 4: Configure hadoop-env.sh

  • 進入 hadoop 目錄,做進一步的設定。我們需要修改四個檔案,第一個是 hadoop-env.sh,需要設定 JAVA_HOME, HADOOP_HOME, HADOOP_CONF_DIR 三個環境變數。
    Change directory to hadoop source folder to configure the basic configuration. We will need to modify four configuration files. The first one is hadoop-env.sh. We will configure three environment variables : JAVA_HOME, HADOOP_HOME, and HADOOP_CONF_DIR. Following the steps listed here:
/opt$ cd hadoop/
/opt/hadoop$ cat >> conf/hadoop-env.sh << EOF

貼上以下資訊
and paste following settings after last command.

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_HOME=/opt/hadoop
export HADOOP_CONF_DIR=/opt/hadoop/conf
EOF

Step 5: 設定 *-site.xml

Step 5: Configure *-site.xml

  • 接下來的設定檔共有3個 core-site.xml, hdfs-site.xml, mapred-site.xml
    Next, let's configure three configuration files including core-site.xml, hdfs-site.xml, mapred-site.xml. Please copy and paste the command:
/opt/hadoop$ cat > conf/core-site.xml << EOF

貼上以下內容
Then paste following settings after last command.

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/var/hadoop/hadoop-\${user.name}</value>
  </property>
</configuration>
EOF

/opt/hadoop$ cat > conf/hdfs-site.xml  << EOF

貼上以下內容
Paste following settings after last command.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
</configuration>
EOF
/opt/hadoop$ cat > conf/mapred-site.xml  << EOF

貼上以下內容
Paste following settings after last command.

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>
EOF

Step 6: 格式化 HDFS

Step 6: Format HDFS

  • 以上我們已經設定好 Hadoop 單機測試的環境,接著讓我們來啟動 Hadoop 相關服務,首先必須格式化 namenode
    Now, we have configured Hadoop single node into pseudo-distributed mode. Let's start Hadoop related services. First, we need to format namenode.
/opt/hadoop$ bin/hadoop namenode -format

執行畫面如:
You should see results like this:

09/03/23 20:19:47 INFO dfs.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = /localhost
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
************************************************************/
09/03/23 20:19:47 INFO fs.FSNamesystem: fsOwner=hadooper,hadooper
09/03/23 20:19:47 INFO fs.FSNamesystem: supergroup=supergroup
09/03/23 20:19:47 INFO fs.FSNamesystem: isPermissionEnabled=true
09/03/23 20:19:47 INFO dfs.Storage: Image file of size 82 saved in 0 seconds.
09/03/23 20:19:47 INFO dfs.Storage: Storage directory /var/hadoop/hadoop-root/dfs/name has been successfully formatted.
09/03/23 20:19:47 INFO dfs.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at /localhost
************************************************************/

Step 7: 啟動 Hadoop

Step 7: Start Hadoop

  • 接著用 start-all.sh 來啟動所有服務,包含 namenode, secondary namenode, datanode, jobtracker 及 tasktracker..
    After formating namenode, now you can use start-all.sh to start all services, including namenode, secondary namenode, datanode, jobtracker, and tasktracker.
/opt/hadoop$ bin/start-all.sh

執行畫面如:
You should see results like this:

starting namenode, logging to /opt/hadoop/logs/hadoop-hadooper-namenode-pc218.out
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-hadooper-datanode-pc218.out
localhost: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-hadooper-secondarynamenode-pc218.out
starting jobtracker, logging to /opt/hadoop/logs/hadoop-hadooper-jobtracker-pc218.out
localhost: starting tasktracker, logging to /opt/hadoop/logs/hadoop-hadooper-tasktracker-pc218.out

Step 8: 完成!檢查 Hadoop 運作狀態

Step 8: Complete!! Let's check the status of Hadoop



DEBUG: 使用 jps 檢查 java 程序

DEBUG: Use jps to check running java process

  • 有些時候您需要使用 jps 指令來檢查目前系統裡面存在哪些 java 程序
    Sometimes it's useful to use jps command to check running java process.
    /opt/hadoop$ jps