Configure Hadoop for multi-node clusters

December 18, 2014

Configure Hadoop for multi-node clusters

In my previous post, I described how to configure and use Hadoop on pseudo-distributed mode.
Hope that would help you for general testing of “How Hadoop works?” on single node.
Since, Hadoop’s main advantage lies with its capability to distribute datasets across clusters known asHDFS(Hadoop Distributed File System), so in this post we’ll use and configure multi-node clusters for Hadoop.

Pre-requisites:
1. Linux machine (for our example 3(three), that is: 1 master and 2 slaves); count of machines can be increased according to the requirements.
2. Hadoop instance (stable release, version 1.2.1)
3. JDK version 1.6 or higher must be installed on the individual linux machine, and set the JAVA_HOME path in the bash_profile(we’ll cover this one by one).

Now, lets do some practical
If you don’t have mutiple linux machines(don’t worry), create your own.
For this, you need to download:

VMWare Player-5.0.1

an image file for Ubuntu-13.04-desktop

Now, create 3 virtual machines using VMWare Player and Ubuntu-13.04 image file. For identical recognition of each machine, name the virtual machine as Master, Slave01, Slave02.

Once the installation is completed, restart the virtual machines.

Now, you are ready to configure.

Check out these steps:
Open terminal on each machine and do the followings:

xyz@ubuntu:~$

check whether ssh is installed

xyz@ubuntu:~$ssh localhost

If it is not installed, install it by running the command:

xyz@ubuntu:~$sudo apt-get install ssh

Now, you can ssh to localhost (this is not password-less)
[we'll see later, how to make it password-less ssh]

Now, check whether JDK is installed

Run the command as:

xyz@ubuntu:~$javac

If installed:

Usage: javac <options> <source files>

where possible options include:

  -g                         Generate all debugging info

  -g:none                    Generate no debugging info

  -g:{lines,vars,source}     Generate only some debugging info

  -nowarn                    Generate no warnings

  -verbose                   Output messages about what the compiler is doing

  -deprecation               Output source locations where deprecated APIs are used

If not installed:

The program 'javac' can be found in the following packages:

 * default-jdk

 * ecj

 * gcj-4.6-jdk

 * gcj-4.7-jdk

 * openjdk-7-jdk

 * openjdk-6-jdk

Try: sudo apt-get install <selected package>

For installing JDK run the command:

xyz@ubuntu:~$sudo apt-get install openjdk-7-jdk

After this, open JDK-7 will be installed on your system.

Now, you need to set the JAVA_HOME

Open the bash_profile as:

xyz@ubuntu:~$sudo vi ~/.bash-profile

Now, write these two lines (if you have installed openjdk-7-jdk)
[Else, you may have different path for java, define those path]

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre/

export PATH=$PATH:/usr/lib/jvm/java-7-openjdk-i386/jre/bin

Now, change the hostname of each machine (so that they are easily recognizable) as following:

On master machine:

nabin@ubuntu:~$ sudo vi /etc/hostname

Change ubuntu to master

On 1st slave machine:

nabin@ubuntu:~$ sudo vi /etc/hostname

Change ubuntu to slave01

On 2nd slave machine:

nabin@ubuntu:~$ sudo vi /etc/hostname

Change ubuntu to slave02

Now, check the ip configuration of each machine

nabin@ubuntu:~$ ifconfig

Add this ip addresses into /etc/hosts of each machine, so that each machine can recognize each other over network.
for example:
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxx slave01
xxx.xxx.xxx.xxx slave02

Now ssh to slave machines from master machine using command:

xyz@master:~$ ssh slave0x

Master should be able to ssh to slave01, slave02
To make it password-less ssh, use following commands:

xyz@master:~$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Generating public/private rsa key pair.

Your identification has been saved in /home/nabin/.ssh/id_rsa.

Your public key has been saved in /home/nabin/.ssh/id_rsa.pub.

The key fingerprint is:

23:b2:09:4c:7e:7d:2b:b0:cc:a4:0c:1c:ef:77:76:d9 nabin@master

The key's randomart image is:

+--[ RSA 2048]----+

|                 |

|                 |

| ..              |

|.+o  .           |

|..+.= o S        |

| o.B * o oo      |

|  o.*..o.o E     |

|    . o..        |

|                 |

+-----------------+

After generation of public and private keys, run the following command:

xyz@master:~$cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now, run the following command for each slaves from master:

xyz@master:~$cat ~/.ssh/id_rsa.pub | ssh xyz@slave0x "cat >> .ssh/authorized_keys"

Now, you are done with all your system configurations.

Lets configure some files of Hadoop-1.2.1 and you are done:
You can edit this using vi mode or you can use WinSCP to open and edit each file. Your choice..!

Put the Hadoop-1.2.1 folder (Hadoop instance) into the home/xyz/ folder of each virtual machines.

hadoop-1.2.1/conf/ on each machine:

In conf/, you need to change:

core-site.xml

hdfs-site.xml

mapred-site.xml

hadoop-env.sh

masters

slaves

In core-site.xml:

1

2

3

4

5

6

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs://master:54310</value>

</property>

</configuration>

In hdfs-site.xml:

<configuration>

<property>

<name>dfs.replication</name>

<value>2</value>

</property>

</configuration>

In mapred-site.xml:

1

2

3

4

5

6

<configuration>

<property>

<name>mapred.job.tracker</name>

<value>hdfs://master:54311</value>

</property>

</configuration>

In hadoop-env.sh set JAVA_HOME:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-i386/jre
In masters, add:
xyz@master
In slaves, add:
xyz@slave01
xyz@slave02

Now you are done with the configuration of multi-node clusters of hadoop.
To start, you need to format the namenode of the HDFS location.
Run the following commands to foramt and start namenode,jobtracker,secondary namenode on master anddatanode,tasktracker on slaves.

xyz@master:~$ cd hadoop-1.2.1

xyz@master:~/hadoop-1.2.1$ bin/hadoop namenode -format

If permission is denied, then run

xyz@master:~/hadoop-1.2.1$ chmod 755 bin/*

and then run the above command.

Now start hadoop daemons as:

xyz@master:~/hadoop-1.2.1$ bin/start-all.sh

Check the running processes using command jps on each machine.

Your configuration for multi-node cluster is complete now. You can run your example..!!

For any queries, please post. I would love to reply to your queries.

Search This Blog

Solutions For All

Configure Hadoop for multi-node clusters

Comments

Post a Comment

Popular Posts

TNSPING is not recognised as an internal or external command

Oracle Temp Tablespace Becomes Too Large – Resize the Temp Tablespace