Friday, May 15, 2015

[IDEA] Autoscaling in Hadoop

Everybody today uses Hadoop + some more of its ecosystem tools. Let it be Hive / HBase etc. I have been using Hadoop for writing production code from 2012 and for experiments much earlier than that. One thing that I don't see it anywhere is the ability to autoscale the Hadoop cluster elastically. Unlike scaling web servers having all map and reduce tasks full doesn't necessarily translate to CPU / IO metrics spiking on the machines. 

Figure - Usage graph observed from a production cluster
hand drawn on white-board

On this front - Only Qubole guys have seem to have done some decent work. You should check out their platform if you haven't. It is really super cool. A lot of inspiration for this post have been from using them.

This is one of my hobby project attempt at building just the Autoscaling feature for Hadoop1 clusters if had been part of say Qubole team back in 2012.

In this blog post I talk about the implementation goals and/or hows building this as part of InMobi's HackDay 2015 (if I get through the selection) or go ahead and build it anyways on that weekend.

For every cluster you would need the following configuration settings

  • minNodes - Minimum # of TTs you would always want in the cluster.
  • maxNodes - Maximum # of TTs that your cluster would like to use at any point in time.
  • checkInterval - Time unit in seconds to check the cluster for compute demand (default - 60)
  • hustlePeriod - Time unit in seconds to monitor the demand before we go ahead with upscaling / downscaling the cluster. - (default - 600)
  • upscaleBurstRate - Rate at which you want to upscale based on the demand (default- 100%)
  • downscaleBurstRate - Rate at which you want to downscale (default - 25%)
  • mapsPerNode - # of map slots per TT  (default - based on the machine type)
  • reducersPerNode - # of reduce slots per TT (default - based on the machine type)
Assumptions
- All the nodes in the cluster are of same type and imageId - easier during upscaling / downscaling.
- All TTs will have Datanodes also along with it

Broad Goals
- Less / No manual intervention at all - We're talking about hands-free scaling and not one click scaling.
- Should have less / no changes in the framework - If we start making forks of Hadoop1 / Hadoop2 to support certain features for autoscaling then most likely we'll have a version lock which is not a pretty thing 1-2 years down the lane. 
- Should be configurable - For users willing to dive deeper for configuring their autoscaling they should have options to do that. Roughly translates to being all blue configurations having sensible defaults.  

Larger vision is to see if we can make the entire thing modular enough to support any type of scaling. 

Please do share your thoughts if you have any on the subject. 

Sunday, March 15, 2015

Winning #GoPluginChallenge

Winning the #GoPluginChallenge was a very big deal for me. Why? Being acknowledged by a team that rejected me in an interview 3 years ago gives at most satisfaction of some kind of achievement in life. All thanks and credits goes to my mentors - Rajesh and Sriram. Special thanks to Manoj who gave me all the motivation for building the Github PR plugin. It is also really nice to know both the plugins that I submitted have won together.

Monday, February 2, 2015

GoCD - Slack Build Notifier

In my last post I wrote about a GoCD plugin that I've been working on. I finally got to complete it this weekend. Check it out https://github.com/ashwanthkumar/gocd-slack-build-notifier.  This is how the final result looks like


There are two features in the plugin that I'm really happy about (apart from pushing messages to slack)

  1. Pipeline rules set. It is heavily inspired from the current email notification framework available as part of GoCD. Check out the "Pipeline Rules" section in README.
  2. Notifier is pluggable. Slack Notifier is provided out of the box. With very little change, one can write any type of notifications transport using the existing framework. 
Overall it was a time well spent that helped me to write (which I guess is) first notification plugin in the GoCD community.

Saturday, January 24, 2015

Slack Java Webhook

GoCD recently added support for notification extension point. I've started building slack notification plugin (its a WIP here). As part of that I wrote a Java client for Slack Webhooks. Although I found a java library here, which said it was published but I couldn't find it anywhere on sonatype / maven central. I can't even publish it, so I took that as an inspiration and wrote my own implementation on https://github.com/ashwanthkumar/slack-java-webhook.

Usage


new Slack(webhookUrl)
    .icon(":smiling_imp:") // Ref - http://www.emoji-cheat-sheet.com/
    .sendToUser("slackbot")
    .displayName("slack-java-client")
    .push(new SlackMessage("Text from my ").bold("Slack-Java-Client"));

It gets posted in the slack channel like below

Dependencies

For Maven,
<dependency>
  <groupId>in.ashwanthkumar</groupId>
  <artifactId>slack-java-webhook</artifactId>
  <version>0.0.3</version>
</dependency>
For SBT,
libraryDependencies += "in.ashwanthkumar" % "slack-java-webhook" % "0.0.3"

Java Utils Library

After a long time I seem to be writing Java code more often recently. A bunch GoCD plugins, code kata sessions with friends and things like that. I saw there are few things like List transformations, filter, I automatically start searching for Option / Some and None implementations. Simple solutions would be just write it in Scala, right? I know, but there are places where I wasn't. Example was GoCD plugins. Reasons being - Scala standard library is heavy and usually causes OOM on Agent without increasing heap sizes and final jar is also heavy in terms of size.

Check out https://github.com/ashwanthkumar/my-java-utils. If you find some implementations not so efficient or can be done better, please do let me know.


Features

List

  • Lists#map
  • Lists#filter
  • Lists#foldL
  • Lists#find
  • Lists#isEmpty
  • Lists#nonEmpty
  • Lists#mkString

Set

  • Sets#copy
  • Sets#isEmpty
  • Sets#nonEmpty

Iterable

  • Iterables#exists
  • Iterables#forall

Lang

  • Option / Some / None
  • Tuple2 / Tuple3
  • Function
  • Predicate

Dependencies

For Maven,
<dependency>
  <groupId>in.ashwanthkumar</groupId>
  <artifactId>my-java-utils</artifactId>
  <version>0.0.2</version>
</dependency>
For SBT,
libraryDependencies += "in.ashwanthkumar" % "my-java-utils" % "0.0.2"