Sunday, October 18, 2015

Chrome Tamil TTS Engine powered by SSN Speech Lab

In an attempt at first good impression that went wrong this was the outcome.

I got to know about SSN's Speech Lab yesterday and built a Chrome extension that builds on top of it as a TTS Engine. These guys have a wonderful system built - You should check out 'em out.


Right click on any Tamil text, right click and listen to it in a Male / Female voice.

You can install the plugin from https://chrome.google.com/webstore/detail/lhalpilfkeekaipkffoocpdfponpojob

The code is available on https://github.com/ashwanthkumar/chrome-tts-tamizh

Details for other extensions using the TTS
- Language - ta-IN
- Gender - male and female
- Voice names - Male is Krishna and Female is Radhae

Monday, October 5, 2015

Introducing scalding-dataflow

For the last 3 days, I've been working on trying to understand the Google Cloud Flow pipeline semantics for batch processing. Result was a ScaldingPipelineRunner for DataFlow pipelines.

NOTICE (You've been warned)
  1. It is still in very very early stages. 
  2. It doesn't have all translators implemented (as of writing)
  3. It isn't tested on a Hadoop Setup yet
It runs WordCount though :) Do give it a spin. 

It goes well with scalaflow - Scala DSL for building dataflow pipelines. 

Special thanks to Cloudera's spark-dataflow project. Couldn't have done without it :)

Friday, September 25, 2015

Winning Amaz-ing Hackathon - Meghadūta

Last weekend was fun at Chennai Amazon office. In association with Venture City, Kindle Team at Chennai organized a hackathon on the theme - "Building Scalable Distributed Systems". The name was theme was good enough to attract me to register for the event :) I went there with Salaikumar and Vijay Kumar to the event under the team name - "Salaikumar".

You can find the problem statements given at the hackathon here.

We won two awards at the event - "Best Voted Award" and "Ultimate Hack Award".

You can find our code at https://github.com/ashwanthkumar/meghaduta.

Few pictures taken during the event

Me doing the presentation of our hack

Salai helping me with the screens

Our prize, a Kindle Paper White each and some certificates :)

Saturday, September 12, 2015

Find that missing Host in Hadoop Cluster

I've started to manage 200 node hadoop clusters recently at work. All these are running on AWS with latest CDH5. We went from separate HDFS + TT model to co-locating TT and DN daemons together.

These machines are all Spot machines backed by a ASG (Auto Scaling Group). If any of them die because of spot prices, they would come back up in a while. So to manage these machines better we attach our own custom generated DNS names to these machines.

Once in a while, the machines that come up doesn't have either a TT or DN daemons running. They would have failed at startup for variety of reasons. The task was to find that missing hosts (generally 1 or 2) of the lot. So I wrote a script that would help us get the missing hosts which don't run one of the process.

Gist - https://gist.github.com/ashwanthkumar/3624a4e69ab26236a746


Monday, August 31, 2015

Kitchen Bento Setup

After struggling couple of days trying to create a new virtualbox with some packages pre-installed for my kitchen tests, I ended up frustrated only to later figure out about mitchellh/vagrant#5492. So, I ended up creating kitchen-bento-setup which would help me create new virtual boxes from the base opscode box.

For reasons unknown to me, I don't get "SSH Authentication Failure" when packaging the VBox using `vagrant package` the first time from base opscode box, any consequtive `vagrant packages` from that pre-built box causes the failure. 

Tuesday, June 2, 2015

ClassNotFound inside a Task on Spark >= 1.3.0

Context - Spark 1.3.0, Custom InputFormat and InputSplit.

Problem - At my work, I had a custom InputSplit definition which had another class A object which then I need to pass it to my Key. I then have a Spark job that reads using my custom InputFormat and things were all fine on Spark 1.2.0. When we upgraded to 1.3.0 things started breaking with the following stack trace.

Caused by: java.lang.ClassNotFoundException: x.y.z.A
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:270)
 at java.io.ObjectInputStream.resolveClass(ObjectInputStream.java:625)
 at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at org.apache.spark.util.Utils$.deserialize(Utils.scala:80)
 at org.apache.spark.util.Utils.deserialize(Utils.scala)
 at x.y.z.CustomRecordSplit.readFields(CustomRecordSplit.java:91)

Solution - It took me a while to realize that I've been using Spark's Util object to serialize and deserialize the object (x.y.z.A). The fix was very simple

objA = Utils.deserialize(buffer, Utils.getContextOrSparkClassLoader());

Looks like in the earlier versions the __app__.jar is getting added as part of the Executor and Task classloader but not in the latest versions. When I passed the Context ClassLoader to the deserialization it worked perfectly fine. 

Lessons
- Don't use Spark's Util method. Even though the Util is a private[spark] object since I'm accessing it from a Java class the scala package access protection doesn't seem to apply. I never knew that until now. 
- Always use Utils.getContextOrSparkClassLoader() when doing Java Deserialization in Spark.

Wednesday, May 20, 2015

Parser Combinator

After fiddling with RegexParser for a while in Scala, I realized how much I missed learning Automata properly at college. I was migrating a 140-line REGEX from MySQL to Scala at my work and learned a lot of new things in the process. It was during one of those times one my mentors - yellowflash, helped me understand some of the forgotten concepts about left factoring, recursive grammars, etc.

In the process he were discussing about how would I go about writing RegexParser if I had to write it by hand myself? The exercise was to help me understand how the Parser Combinator works which would help me write better grammar. We did some scribbling on the paper and decided to implement it in Scala. You can find it on https://github.com/ashwanthkumar/parser-combinator. It is just a start, still a long way to go. Looking forward to it, it should be fun.

Friday, May 15, 2015

[IDEA] Autoscaling in Hadoop

Everybody today uses Hadoop + some more of its ecosystem tools. Let it be Hive / HBase etc. I have been using Hadoop for writing production code from 2012 and for experiments much earlier than that. One thing that I don't see it anywhere is the ability to autoscale the Hadoop cluster elastically. Unlike scaling web servers having all map and reduce tasks full doesn't necessarily translate to CPU / IO metrics spiking on the machines. 

Figure - Usage graph observed from a production cluster
hand drawn on white-board

On this front - Only Qubole guys have seem to have done some decent work. You should check out their platform if you haven't. It is really super cool. A lot of inspiration for this post have been from using them.

This is one of my hobby project attempt at building just the Autoscaling feature for Hadoop1 clusters if had been part of say Qubole team back in 2012.

In this blog post I talk about the implementation goals and/or hows building this as part of InMobi's HackDay 2015 (if I get through the selection) or go ahead and build it anyways on that weekend.

For every cluster you would need the following configuration settings

  • minNodes - Minimum # of TTs you would always want in the cluster.
  • maxNodes - Maximum # of TTs that your cluster would like to use at any point in time.
  • checkInterval - Time unit in seconds to check the cluster for compute demand (default - 60)
  • hustlePeriod - Time unit in seconds to monitor the demand before we go ahead with upscaling / downscaling the cluster. - (default - 600)
  • upscaleBurstRate - Rate at which you want to upscale based on the demand (default- 100%)
  • downscaleBurstRate - Rate at which you want to downscale (default - 25%)
  • mapsPerNode - # of map slots per TT  (default - based on the machine type)
  • reducersPerNode - # of reduce slots per TT (default - based on the machine type)
Assumptions
- All the nodes in the cluster are of same type and imageId - easier during upscaling / downscaling.
- All TTs will have Datanodes also along with it

Broad Goals
- Less / No manual intervention at all - We're talking about hands-free scaling and not one click scaling.
- Should have less / no changes in the framework - If we start making forks of Hadoop1 / Hadoop2 to support certain features for autoscaling then most likely we'll have a version lock which is not a pretty thing 1-2 years down the lane. 
- Should be configurable - For users willing to dive deeper for configuring their autoscaling they should have options to do that. Roughly translates to being all blue configurations having sensible defaults.  

Larger vision is to see if we can make the entire thing modular enough to support any type of scaling. 

Please do share your thoughts if you have any on the subject. 

Sunday, March 15, 2015

Winning #GoPluginChallenge

Winning the #GoPluginChallenge was a very big deal for me. Why? Being acknowledged by a team that rejected me in an interview 3 years ago gives at most satisfaction of some kind of achievement in life. All thanks and credits goes to my mentors - Rajesh and Sriram. Special thanks to Manoj who gave me all the motivation for building the Github PR plugin. It is also really nice to know both the plugins that I submitted have won together.

Monday, February 2, 2015

GoCD - Slack Build Notifier

In my last post I wrote about a GoCD plugin that I've been working on. I finally got to complete it this weekend. Check it out https://github.com/ashwanthkumar/gocd-slack-build-notifier.  This is how the final result looks like


There are two features in the plugin that I'm really happy about (apart from pushing messages to slack)

  1. Pipeline rules set. It is heavily inspired from the current email notification framework available as part of GoCD. Check out the "Pipeline Rules" section in README.
  2. Notifier is pluggable. Slack Notifier is provided out of the box. With very little change, one can write any type of notifications transport using the existing framework. 
Overall it was a time well spent that helped me to write (which I guess is) first notification plugin in the GoCD community.

Saturday, January 24, 2015

Slack Java Webhook

GoCD recently added support for notification extension point. I've started building slack notification plugin (its a WIP here). As part of that I wrote a Java client for Slack Webhooks. Although I found a java library here, which said it was published but I couldn't find it anywhere on sonatype / maven central. I can't even publish it, so I took that as an inspiration and wrote my own implementation on https://github.com/ashwanthkumar/slack-java-webhook.

Usage


new Slack(webhookUrl)
    .icon(":smiling_imp:") // Ref - http://www.emoji-cheat-sheet.com/
    .sendToUser("slackbot")
    .displayName("slack-java-client")
    .push(new SlackMessage("Text from my ").bold("Slack-Java-Client"));

It gets posted in the slack channel like below

Dependencies

For Maven,
<dependency>
  <groupId>in.ashwanthkumar</groupId>
  <artifactId>slack-java-webhook</artifactId>
  <version>0.0.3</version>
</dependency>
For SBT,
libraryDependencies += "in.ashwanthkumar" % "slack-java-webhook" % "0.0.3"

Java Utils Library

After a long time I seem to be writing Java code more often recently. A bunch GoCD plugins, code kata sessions with friends and things like that. I saw there are few things like List transformations, filter, I automatically start searching for Option / Some and None implementations. Simple solutions would be just write it in Scala, right? I know, but there are places where I wasn't. Example was GoCD plugins. Reasons being - Scala standard library is heavy and usually causes OOM on Agent without increasing heap sizes and final jar is also heavy in terms of size.

Check out https://github.com/ashwanthkumar/my-java-utils. If you find some implementations not so efficient or can be done better, please do let me know.


Features

List

  • Lists#map
  • Lists#filter
  • Lists#foldL
  • Lists#find
  • Lists#isEmpty
  • Lists#nonEmpty
  • Lists#mkString

Set

  • Sets#copy
  • Sets#isEmpty
  • Sets#nonEmpty

Iterable

  • Iterables#exists
  • Iterables#forall

Lang

  • Option / Some / None
  • Tuple2 / Tuple3
  • Function
  • Predicate

Dependencies

For Maven,
<dependency>
  <groupId>in.ashwanthkumar</groupId>
  <artifactId>my-java-utils</artifactId>
  <version>0.0.2</version>
</dependency>
For SBT,
libraryDependencies += "in.ashwanthkumar" % "my-java-utils" % "0.0.2"