Configure Hadoop for multi-node clusters


In my previous post, I described how to configure and use Hadoop on pseudo-distributed mode.
Hope that would help you for general testing of “How Hadoop works?” on single node.
Since, Hadoop’s main advantage lies with its capability to distribute datasets across clusters known asHDFS(Hadoop Distributed File System), so in this post we’ll use and configure multi-node clusters for Hadoop.
Pre-requisites:
1. Linux machine (for our example 3(three), that is: 1 master and 2 slaves); count of machines can be increased according to the requirements.
2. Hadoop instance (stable release, version 1.2.1)
3. JDK version 1.6 or higher must be installed on the individual linux machine, and set the JAVA_HOME path in the bash_profile(we’ll cover this one by one).
Now, lets do some practical
If you don’t have mutiple linux machines(don’t worry), create your own.
For this, you need to download:

  • VMWare Player-5.0.1
  • an image file for Ubuntu-13.04-desktop

  • Now, create 3 virtual machines using VMWare Player and Ubuntu-13.04 image file. For identical recognition of each machine, name the virtual machine as MasterSlave01Slave02.
    Once the installation is completed, restart the virtual machines.
    Now, you are ready to configure.
    Check out these steps:
    Open terminal on each machine and do the followings:
    1
    xyz@ubuntu:~$

  • check whether ssh is installed

  • 1
    xyz@ubuntu:~$ssh localhost
    If it is not installed, install it by running the command:
    1
    xyz@ubuntu:~$sudo apt-get install ssh
    Now, you can ssh to localhost (this is not password-less)
    [we'll see later, how to make it password-less ssh]

  • Now, check whether JDK is installed

  • Run the command as:
    1
    xyz@ubuntu:~$javac
    If installed:
    1
    2
    3
    4
    5
    6
    7
    8
    Usage: javac <options> <source files>
    where possible options include:
      -g                         Generate all debugging info
      -g:none                    Generate no debugging info
      -g:{lines,vars,source}     Generate only some debugging info
      -nowarn                    Generate no warnings
      -verbose                   Output messages about what the compiler is doing
      -deprecation               Output source locations where deprecated APIs are used
    If not installed:
    1
    2
    3
    4
    5
    6
    7
    8
    The program 'javac' can be found in the following packages:
     * default-jdk
     * ecj
     * gcj-4.6-jdk
     * gcj-4.7-jdk
     * openjdk-7-jdk
     * openjdk-6-jdk
    Try: sudo apt-get install <selected package>
    For installing JDK run the command:
    1
    xyz@ubuntu:~$sudo apt-get install openjdk-7-jdk
    After this, open JDK-7 will be installed on your system.

  • Now, you need to set the JAVA_HOME

  • Open the bash_profile as:
    1
    xyz@ubuntu:~$sudo vi ~/.bash-profile
    Now, write these two lines (if you have installed openjdk-7-jdk)
    [Else, you may have different path for java, define those path]
    1
    2
    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre/
    export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-i386/jre/bin

  • Now, change the hostname of each machine (so that they are easily recognizable) as following:

  • On master machine:
    1
    nabin@ubuntu:~$ sudo vi /etc/hostname
    Change ubuntu to master
    On 1st slave machine:
    1
    nabin@ubuntu:~$ sudo vi /etc/hostname
    Change ubuntu to slave01
    On 2nd slave machine:
    1
    nabin@ubuntu:~$ sudo vi /etc/hostname
    Change ubuntu to slave02

  • Now, check the ip configuration of each machine

  • 1
    nabin@ubuntu:~$ ifconfig
    Add this ip addresses into /etc/hosts of each machine, so that each machine can recognize each other over network.
    for example:
    xxx.xxx.xxx.xxx master
    xxx.xxx.xxx.xxx slave01
    xxx.xxx.xxx.xxx slave02

  • Now ssh to slave machines from master machine using command:

  • 1
    xyz@master:~$ ssh slave0x
    Master should be able to ssh to slave01, slave02
    To make it password-less ssh, use following commands:
    1
    xyz@master:~$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    Generating public/private rsa key pair.
    Your identification has been saved in /home/nabin/.ssh/id_rsa.
    Your public key has been saved in /home/nabin/.ssh/id_rsa.pub.
    The key fingerprint is:
    23:b2:09:4c:7e:7d:2b:b0:cc:a4:0c:1c:ef:77:76:d9 nabin@master
    The key's randomart image is:
    +--[ RSA 2048]----+
    |                 |
    |                 |
    | ..              |
    |.+o  .           |
    |..+.= o S        |
    | o.B * o oo      |
    |  o.*..o.o E     |
    |    . o..        |
    |                 |
    +-----------------+
    After generation of public and private keys, run the following command:
    1
    xyz@master:~$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
    Now, run the following command for each slaves from master:
    1
    xyz@master:~$cat ~/.ssh/id_rsa.pub | ssh xyz@slave0x "cat >> .ssh/authorized_keys"
    Now, you are done with all your system configurations.
    Lets configure some files of Hadoop-1.2.1 and you are done:
    You can edit this using vi mode or you can use WinSCP to open and edit each file. Your choice..!

  • Put the Hadoop-1.2.1 folder (Hadoop instance) into the home/xyz/ folder of each virtual machines.

    1. Open hadoop-1.2.1/conf/ on each machine:
    In conf/, you need to change:

  • core-site.xml
  • hdfs-site.xml
  • mapred-site.xml
  • hadoop-env.sh
  • masters
  • slaves

  • In core-site.xml:
    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>fs.default.name</name>
    <value>hdfs://master:54310</value>
    </property>
    </configuration>
    In hdfs-site.xml:
    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>dfs.replication</name>
    <value>2</value>
    </property>
    </configuration>
    In mapred-site.xml:
    1
    2
    3
    4
    5
    6
    <configuration>
    <property>
    <name>mapred.job.tracker</name>
    <value>hdfs://master:54311</value>
    </property>
    </configuration>
    In hadoop-env.sh set JAVA_HOME:
    export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre
    In masters, add:
    xyz@master
    In slaves, add:
    xyz@slave01
    xyz@slave02
    Now you are done with the configuration of multi-node clusters of hadoop.
    To start, you need to format the namenode of the HDFS location.
    Run the following commands to foramt and start namenode,jobtracker,secondary namenode on master anddatanode,tasktracker on slaves.
    1
    xyz@master:~$ cd hadoop-1.2.1
    1
    xyz@master:~/hadoop-1.2.1$ bin/hadoop namenode -format
    If permission is denied, then run
    1
    xyz@master:~/hadoop-1.2.1$ chmod 755 bin/*
    and then run the above command.
    Now start hadoop daemons as:
    1
    xyz@master:~/hadoop-1.2.1$ bin/start-all.sh
    Check the running processes using command jps on each machine.
    Your configuration for multi-node cluster is complete now. You can run your example..!!
    For any queries, please post. I would love to reply to your queries.

    Comments