My Standalone Complexities: Hadoop Cluster Deployment + Step-By-Step Process

I've successfully deployed a small cluster of 3 nodes on Hadoop platform. I mark this as the 1st success towards the long road for Research Engine. It took me a while to understand the basics (since this is the 1st time) but it was such a wonderful experience.

The Cluster Specs're:

Core - Ubuntu 10.04 - 2 GB RAM
Core - Ubuntu 9.04 - 1 GB RAM
Virtual PC (VBox 3.2.4) - Ubuntu 10.04 - 512 MB RAM. (Host is 1 machine)

Tests: Grep for Map/Reduce, Content Duplication by going on a copy of 500 MB replicated over 2 nodes.

Info: 1 NameNode, 1 DataNode, and 1 JobTracker

Below is the Step-by-Step procedure to deploy a Hadoop Cluster (for Learning purposes only. This can't be used as such in production environment. Please refer to Official Docs, and latest release for that). See the disclaimer on the bottom before even you start reading beyond this.

In this steps, i shall assume you've 3 -4 systems, on a network and each of them running on Ubuntu 9.04+ with sun-java6-jdk and ssh packages installed. Its preferable to use a new system installation, though its not mandatory.
Due to some issues with Hadoop 0.20.* (latest stable as of writing this post) we shall now (currently) use Hadoop 0.19.2 (stable). You can get a copy of yours from: http://apache.imghat.com/hadoop/core/hadoop-0.19.2/hadoop-0.19.2.tar.gz (53 MB).
Create a new user for Hadoop work. This step is optional. Its recommened, as the path HADOOP_HOME is the same in the cluster.
Extract the hadoop distribution on your home folder (u can extract it anywhere though). So, your HADOOP_HOME will be like: /home/yourname/hadoop-0.19.2
Now, repeat the steps in all the nodes. (make sure the HADOOP_HOME) is the same on all the nodes.
We need the IPs of all the 3 nodes. Let them be: 192.168.1.5, 192.168.1.6, 192.168.1.7. Where *.1.5 is the NameNode, *.1.6 is the JobTracker, these 2 are the main exclusive servers. You can find more info regarding them here (http://hadoop.apache.org/common/docs/r0.19.2/cluster_setup.html#Installation). Node *.1.7 is the DataNode, which is used for both Task Tracking and storing Data.
U'll find a file called: "hadoop-site.xml" under the conf directory of the Hadoop distribution. Copy and paste the following contents between <configuration> </configuration>
<property>
<name>fs.default.name</name>

<value>hdfs://192.168.1.5:9090</value>
<description></description>
</property>

<property>
<name>mapred.job.tracker</name>

<value>192.168.1.6:9050</value>
<description></description>
</property><property>
</property>
Make sure the same is done for all the nodes in the system.
Now, to create the Slaves for the NameNode to replicate the data. Go the HADOOP_HOME directory in the NameNode. Under the folder "conf" you should see a file called slaves.
Upon opening slaves, you should see a line with "localhost". Add the IPs of all the DataNodes you wish to connect to the cluster here, one per line. Sample Slaves will be as follows:

localhost
192.168.1.7
Now, its time to kick-start our cluster.
Open terminal in the NameNode, go to HADOOP_HOME.
Execute the following commands:

# Format the HDFS in the namenode
$ bin/hadoop namenode -format

# Start the Distributed File System service on the NameNode, which will ask you the passwords for itself and All the slaves, to connect via SSH
$ bin/start-dfs.sh
Your NameNode should start and be running. To check the nodes connected to your cluster, go to step 19 and come back.
Now, its the JobTracker Node. Execute the following commands:

# Start the Map/Reduce service on the JobTracker
$ bin/start-mapred.sh
The same process follows for JobTracker. It asks all the password for itself and all its slaves (did i tell u, u can also add slaves to JobTracker?. Its the same process as NameNode, just add the IPs to the slaves file of JobTracker Node's Hadoop distribution), to start the Map/Reduce service.
Now, that we're done starting the cluster. Its time to check it out!
In the NameNode execute the following command:

# Copy a folder (conf) to HDFS - For sample purpose
$ bin/hadoop fs -put conf input
If you go to http://192.168.1.5:50070, on your browser. U should see the Hadoop HDFS Admin interface. Its a simple interface created to meet the purpose. It shows you the Cluster Summary, Live and Dead Nodes etc.
U can browse the HDFS using Browse the filesystem link on the top-left corner.
Go, to http://192.168.1.6:50030, to view the Hadoop Map/Reduce Admin Interface. It displays the current jobs, finished jobs etc.
Now, its time to check the Map/Reduce Process. Execute the following:

# Default example code comes along with the distro.
$ bin/hadoop jar hadoop-*-examples.jar grep conf output 'dfs[a-z.]+'

References:

Disclaimer: This is for my future reference. I don't take any responsibility over physical/mental/any other type of damage that may arise on following the above said process.

My Standalone Complexities

Wednesday, July 28, 2010

Hadoop Cluster Deployment + Step-By-Step Process

2 comments:

Post a Comment