My Standalone Complexities: March 2012

Some time back there was a discussion on the Hadoop User mail list for the list of Hadoop ecosystem tools. I just thought I can put them together with a short description and links to their git repos or products page. If you find an error or feel I have missed out something let me know, I will update it.

Tools are in ascending order of their names.

Ambari - Ambari is a monitoring, administration and lifecycle management project for Apache Hadoop™ clusters.
Hadoop clusters require many inter-related components that must be installed, configured, and managed across the entire cluster. The set of components that are currently supported by Ambari includes: HBase, HCatalog, Hadoop HDFS, Hive, Hadoop MapReduce, Pig, Zookeeper. Visit their website for more information.
Avro - Apache Avro is a data serialization system
Avro provides:
1. Rich data structures,
2. A compact, fast
3. binary data format,
4. A container file, to store persistent data,
5. Remote procedure call (RPC),
6. Simple integration with dynamic languages.
  On the basis of working. It is similar to tools like Thrift or Protobuf but has its own edge as described on their documentation. Well basically to put it short it can be used for providing API like services that leverages Hadoop stack for performing some task.
Bixo - Bixo is an open source web mining toolkit that runs as a series of Cascading pipes on top of Hadoop
By building a customized Cascading pipe assembly, you can quickly create specialized web mining applications that are optimized for a particular use case. More information on their website.
BookKeeper - BookKeeper is a system to reliably log streams of records
It is designed to store write ahead logs, such as those found in database or database like applications. In fact, the Hadoop NameNode inspired BookKeeper. The NameNode logs changes to the in-memory namespace data structures to the local disk before they are applied in memory. However logging the changes locally means that if the NameNode fails the log will be inaccessible. We found that by using BookKeeper, the NameNode can log to distributed storage devices in a way that yields higher availability and performance. Although it was designed for the NameNode, BookKeeper can be used for any application that needs strong durability guarantees with high performance and has a single writer. More Info on their website.
Cascading - Cascading is a Data Processing API, Process Planner, and Process Scheduler used for defining and executing complex, scale-free, and fault tolerant data processing workflows on an Apache Hadoop cluster. All without having to 'think' in MapReduce.
Cascading is a thin Java library and API that sits on top of Hadoop's MapReduce layer and is executed from the command line like any other Hadoop application. Well more detailed documentation can be found on their website.
Cascalog - Cascalog is a Clojure-based query language for Hadoop inspired by Datalog
Cascalog is a fully-featured data processing and querying library for Clojure. The main use cases for Cascalog are processing "Big Data" on top of Hadoop or doing analysis on your local computer from the Clojure REPL. Cascalog is a replacement for tools like Pig, Hive, and Cascading.
Cascalog operates at a significantly higher level of abstraction than a tool like SQL. More importantly, its tight integration with Clojure gives you the power to use abstraction and composition techniques with your data processing code just like you would with any other code. It's this latter point that sets Cascalog far above any other tool in terms of expressive power. General introduction here, and source code here.
Chukwa - Chukwa is an open source data collection system for monitoring large distributed systems
Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a ?exible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. More information can be found here.
Crunch - a Java library that aims to make writing, testing, and running MapReduce pipelines easy, efficient, and even fun. Crunch’s design is modeled after Google’s FlumeJava
Crunch is a Java library for writing, testing, and running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run. Excellent introduction here and Source code is here.
Crux - Reporting tool built for HBase
Crux is a reporting application for HBase, the Hadoop database. Crux helps to query and visualize data saved in HBase. General introduction can be found here and source code ishere.
Elastic Map Reduce - web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data
Amazon Elastic MapReduce utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). More information can be found on the AWS EMR page.
Flume - distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Its main goal is to deliver data from applications to Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications. More information can be found on their wiki and source code is here.
Hadoop common - Hadoop Common is a set of utilities that support the Hadoop subprojects
Hadoop Common is a set of utilities that support the Hadoop subprojects. Hadoop Common includes FileSystem, RPC, and serialization libraries. More info can be gathered here.
Hama - distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations
It was inspired by Google's Pregel, but different in the sense that it's purely BSP and common model, not just for graph. More information can be found here.
HBase - distributed scalable Big Data store
HBase is the Hadoop database. Think of it as a distributed scalable Big Data store. HBase can be used when you need random, realtime read/write access to your Big Data. There is extensive resource on HBase Book.
HCatalog - table and storage management service for data created using Apache Hadoop
Apache HCatalog includes Providing a shared schema and data type mechanism, Providing a table abstraction so that users need not be concerned with where or how their data is stored, and Providing interoperability across data processing tools such as Pig, Map Reduce, Streaming, and Hive. More information can be found on their website.
HDFS - primary storage system used by Hadoop applications
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data. HDFS was originally built as infrastructure for the Apache Nutch web search engine project. More information can be found on their website.
HIHO: Hadoop In, Hadoop Out - Hadoop Data Integration, deduplication, incremental update and more
Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop. More information can be found on theirwebsite.
Hive - data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems
Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. More information can be found on theirwebsite.
Hoop - provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S
Hoop server is a full rewrite of Hadoop HDFS Proxy. Although it is similar to Hadoop HDFS Proxy (runs in a servlet-container, provides a REST API, pluggable authentication and authorization), Hoop server improves many of Hadoop HDFS Proxy shortcomings. More information can be found on their website.
HUE (Hadoop User Environment) - browser-based desktop interface for interacting with Hadoop
Hue is both a web UI for Hadoop and a framework to create interactive web applications. It features a FileBrowser for accessing HDFS, JobSub and JobBrowser applications for submitting and viewing MapReduce jobs, a Beeswax application for interacting with Hive. On top of that, the web frontend is mostly built from declarative widgets that require no JavaScript and are easy to learn. More information can be found on their git repo.
Jaql - Query Language for JavaScript(r) Object Notation (JSON)
Jaql is a query language designed for Javascript Object Notation (JSON), a data format that has become popular because of its simplicity and modeling flexibility. Jaql is primarily used to analyze large-scale semi-structured data. Core features include user extensibility and parallelism. In addition to modeling semi-structured data, JSON simplifies extensibility. Hadoop's Map-Reduce is used for parallelism. More information can be found on their code base.
Lily - Lily is Smart Data, at Scale, made Easy
It is the first data repository built from the ground up to bring Big Data / NOSQL technology into the hands of the enterprise application architect. More detailed information on theirwebsite.
Mahout - machine learning library's goal is to build scalable machine learning libraries
Scalable to reasonably large data sets. Our core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm. However we do not restrict contributions to Hadoop based implementations: Contributions that run on a single node or on a non-Hadoop cluster are welcome as well. The core libraries are highly optimized to allow for good performance also for non-distributed algorithms. More information can be found here.
Map Reduce - MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes
Apache MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. More detail tutorial with examples to run on top of Hadoop can be found here while the general introduction to Map Reduce can be looked upon in Google University.
Nutch - open source Web crawler written in Java
You have all the features you can expect from a web crawler. Now Nutch can be integrated with Hadoop, this resource can help you setting that up. More details of Nutch can be found on their website.
Oozie - workflow/coordination service to manage data processing jobs for Apache Hadoop
Oozie is an extensible, scalable and data-aware service to orchestrate dependencies between jobs running on Hadoop (including HDFS, Pig and MapReduce). Oozie is a lot of things, but being: A workflow solution for off Hadoop processing and Another query processing API, a la Cascading is not one of them. More useful information can be found on here and also here.
Pangool - low-level MapReduce API that aims to be a replacement for the Hadoop Java MR API
By implementing an intermediate Tuple-based schema and configuring a Job conveniently, many of the accidental complexities that arise from using the Hadoop Java MapReduce API disappear.
Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs
The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). More information can be found at Pig Home.
PrestoDB - Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Scalding - Scala API for Cascading
Refere to Cascading for more information. More information of Scalding can be found here. Excellent tutorial can be found here to get face-to-face introduction to Scalding.
Sqoop - tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases
Wonderful documentation on Sqoop can be found on Cloudera, its creator. Also official website is here.
Zookeeper - centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed. More information can be found here.

Back in my work we were working on R and Hadoop using the RHadoop(RMR). The latest release of RMR v.1.2 (download) has quite a few interesting updates to it. See here for a complete overview.

One of our test Hadoop Cluster has 10+ nodes which runs on Hadoop 0.20-append version build specifically for HBase. When we upgraded our RMR package on the cluster with v1.2, we ran into multiple issues. This post is just a summary of my experience on how to patch the Hadoop 0.20.x versions to support RMR v1.2 right away. Hoping it might be helpful for others in the community who might encounter some problems.

First and foremost thing is, when you upgrade the RMR version without patching your earlier Hadoop distribution then you are likely to end up in the error of org.apache.hadoop.contrib.streaming.AutoInputFormat.
So you can follow the instructions as specified in Dumdo's Wiki (https://github.com/klbostee/dumbo/wiki/Building-and-installing) to download the apply the required patches.
This should help you run your latest R codes on the Hadoop cluster. Still you can't use the "combine" parameter if you are using earlier version of hadoop-0.20.203. In which case you might also need HADOOP-4842 patch.

These are the in general very broad steps involved in building the patched version of the Hadoop.

When you try to manage over a large cluster (I am not talking about me), over 40,50 machines building on all the systems is a waste of time and you generally need to have your Hadoop stack brought down. So what I suggest you is, download a local copy of the Hadoop version you are using right now.

You can either download them from the Hadoop releases or checkout from source code, assuming you already dont have a custom built version of Hadoop.

Apply the patches on the local copy (you don't need to edit any configurations or change any parameters). Just apply the patches and build the Hadoop source. Once you apply these patches and build the source code you need to replace only one single JAR file in your production cluster, which is $HADOOP_HOME/contrib/streaming/hadoop*streaming*.jar. All the patches deal only with the Hadoop Streaming only. But do realize you need to build the JAR for your version.

I just wrote a small ant build script that can aid in doing the above process. Which tries to do the above process in an automated way.

PS: Though I have tested the code, still try this at your own risk.

My Standalone Complexities

Monday, March 19, 2012

Fixing RHIPE socketConnection Error

Monday, March 12, 2012

When Psychology meets Web 2.0

Thursday, March 8, 2012

List of Hadoop Ecosystem Tools

Using AWS from Corporate Firewall

Friday, March 2, 2012

Patching Hadoop to support RMR 1.2