Thursday, March 14, 2013

HRegionInfo was null or empty in .META.

I was bulking importing data into HBase today, and stumbled upon the error (error.log)


I wrote the MetaInfo.rb, and got the syslog on the console along with rest of the region names. I had to delete all those entires by hand from '.META.' and it was all fine. 

Monday, March 11, 2013

HFileInputFormat for Bulk repair of HBase

Today something interesting happened to our HBase cluster, of around 1600 regions we currently hold some odd 400+ regions got their STARTKEY as '' (Empty Start Keys). We tired to do an Offline Meta Repair, but that kept failing saying "Multiple regions have the same startkey:".

Quick (Dirty) Fix
  1. Move those regions out of the /hbase/
  2. Now run offline meta repair on the existing data (this is optional, as in our case if your .META. table is screwed you might need to run this, else you might need to find and remove the entries out of .META. manually. I did not try this so, I am not going to dwell on this further) 
Now, write a simple MR job to process all those regions on a CF basis (per-CF) to export them as SequenceFiles and import them back again using the regular HBase Import (org.apache.hadoop.hbase.mapreduce.Import).

When writing the MR job, one thing you might need which is not available of the shelf is the HFileInputFormat. I found a scala version of it, which I ported to Java.


Also, I made some changes to the scala version to be usable on HBase 0.94.2 (tested version)


DISCLAIMER: The above solution was tested with HBase 0.94.2 and Hadoop 1.1.1 setup. 

Saturday, February 16, 2013

SQL Processing using Hive - Chennai HUG, Feb '13

Today's Chennai Hadoop User Group dealt with basic introduction to Apache Hive. It was a full fledged tutorial session from +Senthil Kumar and +Prasad S with demos on Basic Queries and selecting top 500 songs by popularity on 1 Million Song Dataset respectively.

Version used for demo was Hive 0.9.0 (http://www.apache.org/dyn/closer.cgi/hive/hive-0.9.0/)

Hive  and Why?

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Hive Interfaces

Hive comes with three (default) interfaces to work with it.

  1. CLI - Command Line Interface, most widely used interface for working with Hive.
    $ bin/hive
  2. HWI - Hive Web Interface,
    $ bin/hive --service hwi
  3. Server - Hive Server to be used as JDBC backend. 

More Resources on Hive