Hadoop Installation Tutorial
Hadoop mainly consists of two parts: Hadoop MapReduce and HDFS. Hadoop MapReduce is a programming model and software framework for writing applications, which is an open-source variant of MapReduce that is initially designed and implemented by Google for processing and generating large data sets [1]. HDFS is Hadoop’s underlying data persistency layer, which is loosely modelled after Google file system GFS [2]. Hadoop has seen active development activities and increasing adoption. Many cloud computing services, such as Amazon EC2, provide MapReduce functions, and the research community uses MapReduce and Hadoop to solve data-intensive problems in bioinformatics, computational finance, chemistry, and environmental science [3]. Although MapReduce has its limitations [3], it is an important framework to process large data sets.
How to set up a Hadoop environment in a cluster is introduced in this tutorial. In this tutorial, we set up a Hadoop cluster, one node runs as the NameNode, one node runs as the JobTracker and many nodes runs as the TaskTracker (slaves).
First we assume we have created a Linux user “hadoop” on each nodes that we use and the “hadoop” user’s home directory is “/home/hadoop/”.
Enable “hadoop” user to password-less SSH login to slaves
Just for our convenience, make sure the “hadoop” user from NameNode and JobTracker can ssh to the slaves without password so that we need not to input the password every time.
Details about password-less SSH login can be found Enabling Password-less ssh Login.
Install softwared needed by Hadoop
Java JDK:
Java JDK can be downloaded form: http://java.sun.com/. Then we can install (actually copy the jdk directory) Java JDK on all nodes of the Hadoop cluster.
As an example in this tutorial, the JDK is installed into
/home/hadoop/jdk1.6.0_24
I provide a simple bash script to duplicate the JDK directory to all nodes:
$ for i in `cat nodes`; do scp -rq /home/hadoop/jdk1.6.0_24 hadoop@$i:/home/hadoop/; done;
‘nodes’ is a file that contains all the nodes IPs or host names, one in one line.
Hadoop
Hadoop softwar can be downloaded from here. In this tutorial, we use Hadoop 0.20.203.0.
Then we can install Hadoop on all nodes of the Hadoop cluster.
We can directly unpack it to a directory. In this example, we store it in
/home/hadoop/hadoop/
which is a directory under the hadoop Linux user’s home directory.
The hadoop directory can also be duplicated to all nodes using the script above.
Configure environment variables of “hadoop” user
We assume the “hadoop” user use bash as its shell.
Add these two lines at the bottom of ~/.bashrc on all nodes:
export HADOOP_COMMON_HOME="/home/hadoop/hadoop/" export PATH=$HADOOP_COMMON_HOME/bin/:$PATH
The HADOOP_COMMON_HOME environment variable is used by Hadoop’s utility scripts, and it must be set, otherwise the scripts may report an error message “Hadoop common not found”.
The second line adds hadoop’s bin directory to the PATH sothat we can directly run hadoop’s commands without specifying the full path to it.
Configure Hadoop
conf/hadoop-env.h
Add or change these lines to specify the JAVA_HOME and directory to store the logs:
export JAVA_HOME=/home/hadoop/jdk1.6.0_24 export HADOOP_LOG_DIR=/home/hadoop/data/logs
conf/core-site.xml
Here the NameNode runs on 10.1.1.30.
<configuration> <property> <name>fs.default.name</name> <value>hdfs://10.1.1.30:9000</value> </property> </configuration>
conf/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.name.dir</name> <value>/lhome/hadoop/data/dfs/name/</value> </property> <property> <name>dfs.data.dir</name> <value>/lhome/hadoop/data/dfs/data/</value> </property> </configuration>
dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.
conf/mapred-site.xml
Here the JobTracker runs on 10.1.1.2.
<configuration> <property> <name>mapred.job.tracker</name> <value>10.1.1.2:9001</value> </property> <property> <name>mapred.system.dir</name> <value>/hadoop/data/mapred/system/</value> </property> <property> <name>mapred.local.dir</name> <value>/lhome/hadoop/data/mapred/local/</value> </property> </configuration>
mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.
conf/slaves
Delete localhost and add all the names of the TaskTrackers, each in on line. For example:
jobtrackname1 jobtrackname2 jobtrackname3 jobtrackname4 jobtrackname5 jobtrackname6
Duplicate Hadoop configuration files to all nodes
We may duplicate the configuration files under conf diretory to all nodes. The script mentioned above can be used.
By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.
Start Hadoop
We need to start both the HDFS and MapReduce to start Hadoop.
Format a new HDFS
On NameNode (10.1.1.30):
$ hadoop namenode -format
Remember to delete HDFS’s local files on all nodes before re-formating it:
$ rm /home/hadoop/data /tmp/hadoop-hadoop -rf
Start HDFS
On NameNode (10.1.1.30):
$ start-dfs.sh
Check the HDFS status:
On NameNode (10.1.1.30):
$ hadoop dfsadmin -report
There may be less nodes listed in the report than we actually have. We can try it again.
Start mapred:
On JobTracker (10.1.1.2):
$ start-mapred.sh
Check job status:
$ hadoop job -list
Run Hadoop jobs
A simple example
We run a simple example built in Hadoop’s distribution. For easy-to-run and more larger tests, please consider the Simple Sort Benchmark on Hadoop.
Copy the input files into the distributed filesystem:
$ hadoop fs -put /home/hadoop/hadoop/conf input
Run some of the examples:
$ hadoop jar /home/hadoop/hadoop/hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
Examine the output files:
Copy the output files from the distributed filesystem to the local
filesytem and examine them:
$ hadoop fs -get output output $ cat output/*
or
View the output files on the distributed filesystem:
$ hadoop fs -cat output/*
Shut down Hadoop cluster
We can stop Hadoop when we no long use it.
Stop HDFS on NameNode (10.1.1.30):
$ stop-dfs.sh
Stop JobTracker and TaskTrackers on JobTracker (10.1.1.2):
$ stop-mapred.sh
Some possible problems
Firewall blocks connections
Configure iptables: We can configure iptables to allow all connections, if these nodes are in a secure local area network which is most of the situation, by this command on all nodes:
# iptables -F # service iptables save
For a list of the default ports used by Hadoop, please refer to: Hadoop Default Ports.
Pitfalls and Lessons
References
[1] J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters.” in the 6th Conference on Symposium on Operating Systems Design & Implementation, vol. 6, San Francisco, CA, 2004, pp. 137–150.
[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google filesystem,” in Proc. of the 9th ACM Symposium on Operating Systems Principles (SOSP’03), 2003, pp. 29–43.
[3] Z. Ma and L. Gu. The limitation of MapReduce: A probing case and a lightweight solution. In CLOUD COMPUTING 2010: Proc. of the 1st Intl. Conf. on Cloud Computing, GRIDs, and Virtualization, pages 68–73, 2010.
Other Hadoop tutorials
Cluster Setup from Apache.
Managing a Hadoop Cluster from Yahoo.
Tags: Cloud computing, Hadoop, java, MapReduce, Programming, Tutorial
Update history: Apr. 13, 2011. Rewrite the whole tutorial using 0.21.0 as the example platform. Jan. 9, 2011. Change to a more accurate name Feb. 23, 2010. Format this post. May. 1, 2010. Change the part that copy .ssh directory to all machines. Jun. 30, 2010. Format the post. Jul. 13, 2010. Revise the article. The jobtracker and namenode can be same master machine. Jul. 14, 2010. Add iptables configuration commands. Dec. 14, 2010. Format text. Sep. 2, 2011. Add pitfalls and lessons. Jan. 4, 2012: for Hadoop 0.20.203.0. Jan. 15, 2012. Add links to Sort benchmark for Hadoop and the Hadoop default ports.
Hey Thanks a lot!!!!!
There is another article on Hadoop present at http://www.thecloudtutorial.com/hadoop-tutorial.html .