My Standalone Complexities: July 2010

Wednesday, July 28, 2010

Hadoop Cluster Deployment + Step-By-Step Process

I've successfully deployed a small cluster of 3 nodes on Hadoop platform. I mark this as the 1st success towards the long road for Research Engine. It took me a while to understand the basics (since this is the 1st time) but it was such a wonderful experience.

The Cluster Specs're:

Core - Ubuntu 10.04 - 2 GB RAM
Core - Ubuntu 9.04 - 1 GB RAM
Virtual PC (VBox 3.2.4) - Ubuntu 10.04 - 512 MB RAM. (Host is 1 machine)

Tests: Grep for Map/Reduce, Content Duplication by going on a copy of 500 MB replicated over 2 nodes.

Info: 1 NameNode, 1 DataNode, and 1 JobTracker

Below is the Step-by-Step procedure to deploy a Hadoop Cluster (for Learning purposes only. This can't be used as such in production environment. Please refer to Official Docs, and latest release for that). See the disclaimer on the bottom before even you start reading beyond this.

In this steps, i shall assume you've 3 -4 systems, on a network and each of them running on Ubuntu 9.04+ with sun-java6-jdk and ssh packages installed. Its preferable to use a new system installation, though its not mandatory.
Due to some issues with Hadoop 0.20.* (latest stable as of writing this post) we shall now (currently) use Hadoop 0.19.2 (stable). You can get a copy of yours from: http://apache.imghat.com/hadoop/core/hadoop-0.19.2/hadoop-0.19.2.tar.gz (53 MB).
Create a new user for Hadoop work. This step is optional. Its recommened, as the path HADOOP_HOME is the same in the cluster.
Extract the hadoop distribution on your home folder (u can extract it anywhere though). So, your HADOOP_HOME will be like: /home/yourname/hadoop-0.19.2
Now, repeat the steps in all the nodes. (make sure the HADOOP_HOME) is the same on all the nodes.
We need the IPs of all the 3 nodes. Let them be: 192.168.1.5, 192.168.1.6, 192.168.1.7. Where *.1.5 is the NameNode, *.1.6 is the JobTracker, these 2 are the main exclusive servers. You can find more info regarding them here (http://hadoop.apache.org/common/docs/r0.19.2/cluster_setup.html#Installation). Node *.1.7 is the DataNode, which is used for both Task Tracking and storing Data.
U'll find a file called: "hadoop-site.xml" under the conf directory of the Hadoop distribution. Copy and paste the following contents between <configuration> </configuration>
<property>
<name>fs.default.name</name>

<value>hdfs://192.168.1.5:9090</value>
<description></description>
</property>

<property>
<name>mapred.job.tracker</name>

<value>192.168.1.6:9050</value>
<description></description>
</property><property>
</property>
Make sure the same is done for all the nodes in the system.
Now, to create the Slaves for the NameNode to replicate the data. Go the HADOOP_HOME directory in the NameNode. Under the folder "conf" you should see a file called slaves.
Upon opening slaves, you should see a line with "localhost". Add the IPs of all the DataNodes you wish to connect to the cluster here, one per line. Sample Slaves will be as follows:

localhost
192.168.1.7
Now, its time to kick-start our cluster.
Open terminal in the NameNode, go to HADOOP_HOME.
Execute the following commands:

# Format the HDFS in the namenode
$ bin/hadoop namenode -format

# Start the Distributed File System service on the NameNode, which will ask you the passwords for itself and All the slaves, to connect via SSH
$ bin/start-dfs.sh
Your NameNode should start and be running. To check the nodes connected to your cluster, go to step 19 and come back.
Now, its the JobTracker Node. Execute the following commands:

# Start the Map/Reduce service on the JobTracker
$ bin/start-mapred.sh
The same process follows for JobTracker. It asks all the password for itself and all its slaves (did i tell u, u can also add slaves to JobTracker?. Its the same process as NameNode, just add the IPs to the slaves file of JobTracker Node's Hadoop distribution), to start the Map/Reduce service.
Now, that we're done starting the cluster. Its time to check it out!
In the NameNode execute the following command:

# Copy a folder (conf) to HDFS - For sample purpose
$ bin/hadoop fs -put conf input
If you go to http://192.168.1.5:50070, on your browser. U should see the Hadoop HDFS Admin interface. Its a simple interface created to meet the purpose. It shows you the Cluster Summary, Live and Dead Nodes etc.
U can browse the HDFS using Browse the filesystem link on the top-left corner.
Go, to http://192.168.1.6:50030, to view the Hadoop Map/Reduce Admin Interface. It displays the current jobs, finished jobs etc.
Now, its time to check the Map/Reduce Process. Execute the following:

# Default example code comes along with the distro.
$ bin/hadoop jar hadoop-*-examples.jar grep conf output 'dfs[a-z.]+'

References:

Disclaimer: This is for my future reference. I don't take any responsibility over physical/mental/any other type of damage that may arise on following the above said process.

Tuesday, July 27, 2010

Opensource Project Hosting @ SRC - Part II

Hello,

This is just an update blog post. The "Opensource Project Hosting", at SRC has been approved by our System Admin. So, if anyone is interested in join this project please let me know!

Contact Details: ashwanth.kumar@gmail.com

Requirements: U should be a student of SRC.

Sunday, July 25, 2010

Project Gym Buddy

My first idea for a robotic project. Inspired by Ramaprasanna Chellamuthu's presentation at Microsoft Community Tech Days, i've actually started thinking robotically ;) Now, let start with the project description.

A short history
Today we had the usual power cut at 0800 hours, and i was doing my workout at my exercise cycle, when this stuck me.

Goal
Use my exercise cycle for playing games, do all (thats almost that is) my gym activities right with it. Why would any one want to play games with their exercise cycles, when u've high end Wii remotes and stuff?? Well its because i want to multi task! ;)

Description
---------------------------

Crap it Ash, take a look here. http://blogs.pcworld.com/tipsandtweaks/archives/003120.html

Saturday, July 24, 2010

User is the King!

While browsing through Yahoo! Labs, i came upon 2 projects, which actually inspired me a lot. So, this is just a brief implication what i've got from them, and use of similar feature in my Research Engine project.

Keystone (http://labs.yahoo.com/project/59?from=project) - new generation advertising technology. Currently under Development.

The thing that really got my eye is this, "The biggest scientific challenge in contextual advertising is that compared to sponsored search or Web search, user intent is not very clear".

My Question: How on the earth are we to find the context of the user who we don't know or can't see?
Their Answer: The Keystone system works by first extracting "essence" from opportunity - understanding what the content is about and who is viewing it.

More Info: A key difference between Keystone and other contextual advertising systems is that Keystone tries to predict and model user response based on all user context, including page content, user attributes like behavioral and geographical data, referrals to the page (how the user got there), and information about the publisher page.

Read the rest here.

Another project is Motif, from Search technologies group of Yahoo! Research.

Motif (http://labs.yahoo.com/project/176?from=project) - lets users search for Wikipedia articles with context.

Project Motif is very similar to Keystone (which is focused on advertising), in usage of context. The thing is Motif is more concerned about Query Context, than User Context. Try out the demo here, you'll know what i mean. This is relatively easy to implement and maintain.

I really like Motif for its search relevance, and like to add a similar feature to my Research Engine Search Module. Also, Keystone methodology helps me understand user context, based on which i can search the query context to further grain my results.

Got any similar kinda stuff? Please share!

Open Source Project Management in SRC

The following post is available for all my SRC, SASTRA frnds only. Never mind, if u're an old or present student if u've any connection for SASTRA SRC, Kumbakonam. Then its for you!

I would like to propose a new project to implement using the H/W power on our college server. Its Opensource project hosting. Like sourceforget.net, Google Code Project Hosting, Git, etc.

Features of this proposal

Students given an account and allowed to create multiple projects on the server.
Server supports Subversion, Mercurial, and if possible Git also.
They're given a Issue Tracking system (like Trac).
Others can also join them to code an application.

General focus is on final year students, who can use this repository for their final year projects, so that once they're out, juniors can work on the same topic to improve the existing system. Actual scope of this project is very less, but the main reason behind is to make is more effective along side the SRC-FOSS and make students contribute to it (not only by using/sitting in the class but) by actually working on wat they use as OSS.

Please post your like or dislike of this proposal in the comments.

PS: This is just my own proposal, i'ven't yet talk'd about this to any FOSS (pre/post/current) member(s). Will they approve for this? OMG! Thats a $1,000,000,000 question!

Update 1: This is a series of project proposals for SRC. If you've any ideas or proposals, leave it as a comment.

Friday, July 23, 2010

SRC Office Automation - On 2nd (3rd) Year

Its been 2 years now, since i've first stepped foot on my SASTRA SRC campus as a student. Its been so great that i never realized the day flying by. I would like to add a feather on this beautiful day, with a short info that happened just yesterday.

"Our Office, u know what? Maintains all the student info in a Excel Workbook!! OMG! Is SASTRA that bad?"

"Every time, i need to pay the fees and get a receipt, that guy takes eternity!!"

"SRC needs to grow a lot more!"

These were some of the usual comments we students normally say on seeing the condition of our management (at least in our campus). All that now gone in thin air? (Y??) I've been called by our AO (or is it EO, or an office staff?? I'm still not sure about it actually but all i know is that its official) and has been asked to do automation application for our campus.

I saw the Excel sheets in my own eyes and never believed the complexity involved! That poor guy maintains 5 *.xls files, with 6 - 10 sheets each (most of them were reports). The speed at which he moves in Excel was out-standing. To be really honest i never thought our campus office guy can work at this speed.

Who would've thought? Me getting the work, which i've been commenting about for the past 2 years. I feel very proud (dont know y though, probably cauz its for our office) and happy :)

Today, i completed the Data Model for their application and got the approval from the client (:wink:).

TO My Fellow SRC Dudes_&_Dudities: If you're interested, please let me know. Cauz i'm working on this project and many others (for our campus) alone. If you would like to join hands with me please do so!

Thursday, July 22, 2010

Which Data Warehouse Infrastructure to use?

OMG! Problem of selection of components has finally started again. Question is: Which data warehouse infrastructure to use for Research Engine?

Available choices are:

Apache Hive - http://hadoop.apache.org/hive/
IBM InfoSphere Warehouse - http://www-01.ibm.com/software/data/infosphere/warehouse/
Mike 2 - http://mike2.openmethodology.org/
MySQL (Really ?) - http://opensourceanalytics.com/2005/11/03/data-warehousing-with-mysql/

If you've got any experience in implementing any one (or more) of the above list, please do let me know. I can surely use your help. Got anything more? Better suggestions? Please do comment..

Update: Since its TGMC, i'm sticking with InfoSphere.

Research Engine - Work Flow

A proposal for the event work flow for Research Engine, If u've suggestions or improvements, please do post it as a comment.

Get the user input search terms or Query
Find the Model (or Domain) at which the Query belongs to. This step is to find the model of the user query using keywords. The purpose of this step is to identify as much related models of the query as possible, for the Query processing based on Language Processing (LP) techniques.
Once the list of related models is identified, the query is now under the process of Language Processing (LP). This step ensures the evolution of the Research Engine, over time. This step does the following work: Understand the Query, Identify the exact related models (if any) or Create new Models (if none).
Once we identify the Models regarding the query, query the Model Data Store (DS) to fetch the related information about the model (subset of the model).
The output of the previous step gives all the related information, the user wants. Now, all that is left is to output the processed info in any format of choice (depending upon the application).

Well, this is the initial setup of the RE, so this event flow is subject to change at any time without notice. If you've suggestions or improvements over the existing design please do let us know.

Wednesday, July 21, 2010

Research Engine - Interactive Search Engine

Atlast our mentor accepted our proposal, "Research Engine". Its basically an incremented version of Semantic Search Engine. Its being planned to be built from down to top in a pure interactive way. Since, too much of interaction makes the user lazy or dislike the concept, we're planning to extract meta data from social networking profiles of the users(like Facebook, Orkut, Twitter, MySpace, etc.) to automate the process of interaction and improve it, dynamically

Also, we're planning to build this project on-top of Nutch. Many modifications are required to make it a semantic search engine. DB2 9.5 Enterprise, Jena, WASCE, Hadoop, are some the major components to be included in the project.

PS: I'll try to make updates like this regular, but not sure about it either.

Tuesday, July 20, 2010

My TGMC teammates this year

Atlast after a long struggle for team members with vibrant interest and enthu to match my frequency, i got myself the best pieces of SRC. Below are the names in Alphabetical order:

Ashwanth Kumar - III CSE
Kirubaharan A - III CSE
Saravana Kumar - II CSE
Swetha S - II CSE

Now that the team is set, we're all ready for the launch of TGMC 2010.

Tuesday, July 6, 2010

Hello Blog

Hello, I'm Ashwanth. I never actually sit and blog, but i just thought i'll give it a try! Let's see how well does this actually go.