Friday, March 2, 2012

Patching Hadoop to support RMR 1.2

Back in my work we were working on R and Hadoop using the RHadoop(RMR). The latest release of RMR v.1.2 (download) has quite a few interesting updates to it. See here for a complete overview.

One of our test Hadoop Cluster has 10+ nodes which runs on Hadoop 0.20-append version build specifically for HBase. When we upgraded our RMR package on the cluster with v1.2, we ran into multiple issues. This post is just a summary of my experience on how to patch the Hadoop 0.20.x versions to support RMR v1.2 right away. Hoping it might be helpful for others in the community who might encounter some problems.


  1. First and foremost thing is, when you upgrade the RMR version without patching your earlier Hadoop distribution then you are likely to end up in the error of org.apache.hadoop.contrib.streaming.AutoInputFormat. 
  2. So you can follow the instructions as specified in Dumdo's Wiki (https://github.com/klbostee/dumbo/wiki/Building-and-installing) to download the apply the required patches. 
  3. This should help you run your latest R codes on the Hadoop cluster. Still you can't use the "combine" parameter if you are using earlier version of hadoop-0.20.203. In which case you might also need HADOOP-4842 patch. 
These are the in general very broad steps involved in building the patched version of the Hadoop. 

When you try to manage over a large cluster (I am not talking about me), over 40,50 machines building on all the systems is a waste of time and you generally need to have your Hadoop stack brought down. So what I suggest you is, download a local copy of the Hadoop version you are using right now. 

You can either download them from the Hadoop releases or checkout from source code, assuming you already dont have a custom built version of Hadoop. 

Apply the patches on the local copy (you don't need to edit any configurations or change any parameters). Just apply the patches and build the Hadoop source. Once you apply these patches and build the source code you need to replace only one single JAR file in your production cluster, which is $HADOOP_HOME/contrib/streaming/hadoop*streaming*.jar. All the patches deal only with the Hadoop Streaming only. But do realize you need to build the JAR for your version. 

I just wrote a small ant build script that can aid in doing the above process. Which tries to do the above process in an automated way. 


PS: Though I have tested the code, still try this at your own risk. 

No comments:

Post a Comment