Wednesday, February 29, 2012

Twitter Streaming Limit Workaround


I was working on my final year project (BlueIgnis) which uses Streaming Twitter API. I had the following understandings from them (on free version):
  1. One Account can open only One Streaming Connection at any given time
  2. One IP may be associated with only One Account while streaming. Rotation of Streaming connections based on multiple accounts are not allowed. May lead to IP Ban. (All the more reason to use EC2 Instances for Streaming :P)
  3. One Streaming connection may allow upto 400 tracks (different keywords) to filter from.
  4. Reply to 402 Error codes with proper HTTP Status.
  5. Should use non-aggressive re-connect policies, must give substantial amount of time in-between subsequent requests.
  6. Periodically we must stop the Streaming Connection, add more tracks (keywords) to the list and re-start the connection, rather than individual connections for multiple times.
Based on these understandings, I came up with own Architecture for Twitter Streaming. Below diagram represents the overall architecture of my application with respect to Twitter Streaming Component.



Hosting the Twitter Streaming on an EC2 Instance, we can achieve 400 tracks (keywords) per node which can handle approx. 30 - 50 customers based on my use-case. I periodically (~10 min) check if there are any new tracks that needs to be added to the node until it becomes 400. Since I need to know which user requested the track, which is not possible to get from the current way the Streaming API works. 

So I decided to build a Local Firehouse, where in I stack all the tweets for all the tracks, all in a single location. Then, I use a FullText Search feature of MySQL (my datastore) to search for the related tweets continously so that I can achieve the feel of a bit delayed streaming yet close-to-realtime processing.

If you have any better ways to get things done, please let me know.

Saturday, February 25, 2012

Hadoop AutoKill Hack

February's Hack! As you might have already known that I am working on Big Data, and by De facto we all use Hadoop eco-system to get things done. On the same page, I was just looking into some Hadoop Java API the other day to see how well I can get to see somethings happening under the hood.

More specifically I was trying to use JobClient class to see if I can build some custom client or an interface to the Hadoop Jobs we run on our cluster. During which I thought, can I add custom Job Timeout feature to Hadoop.

Problem Statement: I want to kill any job that runs beyond T time units in Hadoop and how do I do it? 

So I started writing custom client which can interact with the JobTracker to get the list of running jobs, and how long they have been running. If they exceed my given threshold time limit I would want to kill them. That is the overall concept, and I guess what I built it. API is so simple and straight forward, all it took was less than an hour to look into the jobdetails.jsp and see how to access the Jobs from the JobTracker and display the start time.

However the tricky thing was how to run the damn thing. I always got the "IOException: Broken Pipe" error. Then finally got the way we need to access it, was through running it as

$ hadoop jar JARName.jar

So, yeah I wrote a small hack for this. You can find it on my Git (https://github.com/ashwanthkumar/hadoop-autokill).

Thursday, February 9, 2012

Introducing Scraphp - Web Crawler in PHP


Scraphp (say Scraph, last p is silent) is a web crawling program. It is basically built to be a standalone executable which can crawl websites and store extract useful content out of it. I created this script for a challenge posted by Indix on Jan 2012, where in I was asked to crawl AGMarket (http://agmarknet.nic.in/) to get the prices of all the products, and store their prices. I also had to version the prices such that it should persist across dates.
Scraph was inspired from a similar project called Scrappy, written in Python. This is not an attempt to port it, but just wanted to see how much similar properties can I build from it in less than a day.
One of the major features I would like to call it is, When you crawl the page you can extract entites out of it based on XPath. So basically when we crawl a page I create a bean whose properties are set of values got by applying the given XPath on the page. Each XPath is completely independent of the other. Currently Scraph supports creating only 1 type of object per page.
Hack into the source code, its well commented and easy to modify as per requirement. All the details of the crawling page, XPath queries are all provided in the configuration.php or you can supply your own config file, see the Usage. 
Code is available on my Git Repo - https://github.com/ashwanthkumar/scraphp 
I have tired my best to document the entire code well, and if you feel like any improvements can be made or you have got any suggestions? Please do not hesitate to fork and send me a pull request. 

Wednesday, February 8, 2012

Consuming CommonSense Knowledge on MR

ConceptNet5 one of my all time favorite dataset available out there. I am working with it in more detail for BlueIgnis (more details on this later). After coming into Big Data of Mu Sigma, this thought has been lingering in my mind. CN5 is really a large dataset ~24 GB of exported JSON data, goes upto ~111GB with indices (as explained on the link), y not use MapReduce to spice things up a bit?

When I joined the company last month, I was told to start with R (Statistical Language). I always wanted to port the Divisi2 to Java or PHP so that I can hack into it more. After a day of getting to know R, I wrote a simple wrapper in R to build CommonSense Matrix. Not the entire thing, just a sample of it with imaginative data and made it work (R code here).

Well basically its all doing SVD and operating on its components - U, V and E (Sigma), to get make predictions. Blah.. blah.. you could have read it in the page in detail if you can understand math (unlike me).

What am I trying to do here?
What I was wondering is, Number of Concepts (nodes) in CN5 exceeds way more than what I can imagine (I am yet to count them as I still have 32-bit system with me, and MongoDB can't hold more than 2GB of data on Win32 systems. Sigh!), not to mention the relations of each concept the same with the columns of the matrix. If only I could transfer data from Mongo to HBase and use Mahout's SVD Implemenation to build the required matrice and store it in HBase (again). I guess that should put me to use commonsense dataset based processing of data. I need to process realtime Tweets and FB Posts in Storm for BlueIgnis, would it match the performance on Real-time basis? Is this even possible? I don't have answers to these and many related questions yet. Just an idea, yet to hack into it more.

Let me know if you have already implemented this or working on similar road.

Updates:
Some interesting thought on GraphLab usage and performance over Mahout's implementation, here. (See the comments).

PS: Above idea was thought over a cup of tea and some cake at hand with no work to do. If you have already got anything like this, I would love to hear from you. 

I am Alive!

Been like ages since I wrote a post here. Lot of interesting things been going on around my life. Will try to keep this space updated in the forth coming weeks.

Joined in Mu Sigma as Intern last month and got into their Innovation and Development (generally R&D) department with Big Data. Work so far has been excellent, Playing with clusters, learning new languages, and some boring math, but still its all part of the game isn't?

PS: This is just an update post, to denote I am still alive and not dead.