Taming the hadoop

I am presently working on Edge Node File System a massively scalable distributed file system developed by Prof. Janakiram and Mr. Kovendhan Ponnavaiko of IIT Madras. I am working on a Java implementation of the file system.

We are trying to compare the performance of our file system with other popular distributed file systems. Currently the hottest DFS is Hadoop ( http://hadoop.apache.org/core/ ). So I had to get a hadoop installation up and running. It took me a long while to sort out all the problems and get a fully running cluster. I am recording the steps I followed. Might spare someone a few hours of headache. However this is not a complete guide. Use it along with

  1. http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html
  2. http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
  3. http://public.yahoo.com/gogate/hadoop-tutorial/start-tutorial.html

1. After downloading the latest release of hadoop unpack it to a directory. Hadoop needs to be installed in the same path across all machines in the cluster. So choose a directory like /home/HadoopUser.
2.Hadoop has a namenode which will handle all the metadata. This is the master. There is another master – the jobtracker which handles computations. Both may or may not be on the same machine. I used the same machine for both.
3. Go to conf directory.
a.Edit JAVA_HOME attribute in hadoop-env.sh to your installed jdk path. Again ensure that this path is valid across all machines ie every machine should have jdk installed at the same path.Also specify the absolute path. It didn’t parse relative path properly for me.(JDK > 1.5)
b.Write the ip of the master(s) in the masters file. Only ip is enough. Don’t specify the username or the namenode port Eg:
c.Now in the slaves file write the ips of all slave machines (again ips only)
d.For version 20.0 and above:
conf directory contains the foll 3 files core-site.xml,hdfs-site.xml,mapred-site.xml. Open HADOOP_HOME/src/core/
core-default.xml,HADOOP_HOME/src/hdfs/hdfs-default.xml, HADOOP_HOME/src/mapred/mapred-default.xml. These are the default configuration files which should be over ridden by the 3 files in conf directory.

  • Copy the property fs.default.name from core-default.xml and paste within the configuration tags in core-site.xml. Now edit the value to hdfs://<Your Ip> Eg: hdfs://10.*.*.* The hdfs is the default uri scheme to be used in the dfs. (Setting this to file:/// causes the hdfs to read the machines local file system)
  • Copy the property mapred.job.tracker to mapred-site.xml and set the value to hdfs://: Eg: hdfs://10.*.*.*:50040 In the hadoop site it has been mentioned that this property should contains host:port pair. But including uri scheme is essential.

Optional: You can set dfs.replication to any value you want in hdfs-site.xml

For < version 20.0 conf directory has 2 files hadoop-default.xml and hadoop-site.xml. Do the changes mentioned above in hadoop-site.xml

4. Master uses ssh login to communicate with the slaves.So it needs pass phrase less login. For this generate a passphrase less ssh key

Do ssh-keygen -t rsa -P “” You will get a prompt asking you if to save this to .ssh/id_rsa . Let it. Now add this to .ssh/authorized_keys.

cat .ssh/id_rsa.pub >> .ssh/authorized_keys
Now try ssh HadoopUser@localhost. You must be able to login without a passwd prompt.
Now this ssh key should be copied to the authorized_keys of all slaves. I did this:
for entry in `cat ./hosts`;
ssh $entry;

The file hosts should contain the list of username@ip of all slaves each on a new line. Eg HadoopUser@
Executing the above script will continuously span the shells for you. Do this in each shell
scp hadoopUser@master:~/.ssh/id_rsa.pub ./
mkdir .ssh
cat id_rsa.pub >> .ssh/authorized_keys
Now try executing the first ssh login script. You must be logging in without any passwd.
Keep a multicopy.sh ready with you always for copying files from master to slaves
for entry in `cat ./hosts`;
scp ./file $entry:~/ ;

Note: In HADOOP_HOME/bin there is a script slaves.sh which runs the shell command specified as argument on all slaves. Though this work with other commands I had problems with scp command.
5. At namenode go to hadoop home and format the name node by running bin/hadoop namenode -format. You should get a message informing you of format details. Now run bin/start-dfs.sh. Go to jobtracker node and run bin/start-mapred.sh. If both are the same you can just run bin/start-all.sh. Now your installation is up. Check this by visiting localhost:50070 This is the web interface of you hadoop. Check the number of live nodes. If it is zero you have a problem with the configuration. If just the local data node is up connection is the problem Check for firewalls or other common causes. It is common for some datanodes to be dead. Check their logs to identify the problem.
If everything is fine you can go ahead and run hadoop commands on different nodes and check the consistency of file system.
bin/hadoop fs -mkdir test
bin/hadoop fs -ls
bin/hadoop -put

Run a few map reduce examples to test the job tracker
bin/hadoop jar hadoop-*-examples.jar pi 10 1000
Check the hadoop site for more examples and running details
When done do bin/stop-all.sh

5. You can add more slaves to an existing installation just by updating the slaves file in master. Copy the hadoop installation to slave using the script mentioned.Stop and start dfs again. Now a new datanode should be seen at localhost:50070.
If you decide that the system is inconsistent and needs to be started all over again. You can stop it and do a namenode format. But this will erase all dfs data. Also remember to erase the previous dfs data stored by default as /tmp/hadoopuser*. Do rm -rf /tmp/hadoopUser* on all slaves.

ECLIPSE MapReduce Plugin:

Eclipse has a plugin for hadoop which simplifies its management. Copy and paste the plugin HADOOP_HOME/contrib/eclipse-plugin/hadoop-0.20.0-eclipse-plugin.jar in the plugins directory of your eclipse installation. Now go to Window/Open Perspective/other and select Map/Reduce. You will see a new MapReduce console at bottom. Click on the elephant symbol at right end to create a new DFS location. In the box that opens specify the configuration parameters you specified in the conf files. Now go to Window/ShowView/other and click on MapReduce/MapReduce Locations. You get a sidebar DFSLocations. This is the DFS directory of the hadoop installation. Right click and it has all options to create a new directory/file, Upload or download files and directories.

Simulating Load:

Hadoop has also provided an excellent load simulator bundled with it. But for some reason they have provided only the source inside src/test. The files are under org/apache/hadoop/fs/loadGenerator. This is supposed to be the package path. So create a new Eclipse project and create a package org.apache.hadoop.fs.loadGenerator Copy the 3 files StructureGenerator.java, DataGenerator.java and LoadGenerator.java.(TestLoadGenerator uses a class MiniDFSCluster which is found nowhere). Create another driver class TestDriver. This should be the main class of the project. To create a driver class refer to ExampleDriver.java inside the src/examples folder.Now create a new Run configuration with this TestDriver and then create a runnable jar with this as its run configuration. Now we are ready.  Copy this jar to your hadoop home as say, loadtest.jar. Call bin/hadoop jar loadtest.jar structuregenerator [args] (if structuregenerator was the user string defined by you in driver that should call StructureGenerator.class ). This would generate a directory structure and file structure. Then call DataGenerator and finally LoadGenerator. For various configuration parameters refer to http://hadoop.apache.org/core/docs/r0.20.0/SLG_user_guide.html

2 Responses to “Taming the hadoop”
  1. muthu raman says:

    not even understand a single word

  2. gitmo says:

    This is a fantastic guide! I’m glad I landed on this page. As of now, I’m just trying out stuff on a single node setup and don’t have a multi-node setup yet. But this should be very helpful when I graduate to that.

    Will try out and come back to you if i have any difficulties.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: