Configure and use Map-Reduce and HDFS in pseudo-distributed mode


Here is the note that describe Hadoop set-up and configure for running pseudo-distributed Hadoop node.
Supported Platforms:
1. Linux is supported as a development and production platform.
2. Win32 is supported as a development platform.
Here, I am using linux as a demonstration.
Required Softwares:
1. Java 1.6.x, preferably from Sun, must be installed.
2. ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons.
Download
To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors.
Prepare to Start the Hadoop Cluster
Unpack the downloaded Hadoop distribution. In the distribution, edit the file conf/hadoop-env.sh to define at least JAVA_HOME to be the root of your Java installation.
Try the following command:
$ bin/hadoop
This will display the usage documentation for the hadoop script.
Pseudo-Distributed Operation
Hadoop can be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
For this we have to configure few files:
1. conf/core-site.xml
2. conf/hdfs-site.xml
3. conf/mapred-site.xml
In the file conf/core-site.xml add:
1
2
3
4
5
6
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>
In the file conf/hdfs-site.xml add:
1
2
3
4
5
6
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
In the file conf/mapred-site.xml add:
1
2
3
4
5
6
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>
Now check that you can ssh to the localhost without a passphrase.
$ ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now format the distributed filesystem:
$ bin/hadoop namenode -format
Now start the Hadoop daemons:
$ bin/start-all.sh
To see the working of map-reduce run the following example:
$ bin/hadoop jar hadoop-examples-*.jar pi 10 100000
Post your comments and recommendations.

Comments