ConceptNet5 one of my all time favorite dataset available out there. I am working with it in more detail for BlueIgnis (more details on this later). After coming into Big Data of Mu Sigma, this thought has been lingering in my mind. CN5 is really a large dataset ~24 GB of exported JSON data, goes upto ~111GB with indices (as explained on the link), y not use MapReduce to spice things up a bit?
When I joined the company last month, I was told to start with R (Statistical Language). I always wanted to port the Divisi2 to Java or PHP so that I can hack into it more. After a day of getting to know R, I wrote a simple wrapper in R to build CommonSense Matrix. Not the entire thing, just a sample of it with imaginative data and made it work (R code here).
Well basically its all doing SVD and operating on its components - U, V and E (Sigma), to get make predictions. Blah.. blah.. you could have read it in the page in detail if you can understand math (unlike me).
What am I trying to do here?
What I was wondering is, Number of Concepts (nodes) in CN5 exceeds way more than what I can imagine (I am yet to count them as I still have 32-bit system with me, and MongoDB can't hold more than 2GB of data on Win32 systems. Sigh!), not to mention the relations of each concept the same with the columns of the matrix. If only I could transfer data from Mongo to HBase and use Mahout's SVD Implemenation to build the required matrice and store it in HBase (again). I guess that should put me to use commonsense dataset based processing of data. I need to process realtime Tweets and FB Posts in Storm for BlueIgnis, would it match the performance on Real-time basis? Is this even possible? I don't have answers to these and many related questions yet. Just an idea, yet to hack into it more.
Let me know if you have already implemented this or working on similar road.
Updates:
Some interesting thought on GraphLab usage and performance over Mahout's implementation, here. (See the comments).
PS: Above idea was thought over a cup of tea and some cake at hand with no work to do. If you have already got anything like this, I would love to hear from you.
When I joined the company last month, I was told to start with R (Statistical Language). I always wanted to port the Divisi2 to Java or PHP so that I can hack into it more. After a day of getting to know R, I wrote a simple wrapper in R to build CommonSense Matrix. Not the entire thing, just a sample of it with imaginative data and made it work (R code here).
Well basically its all doing SVD and operating on its components - U, V and E (Sigma), to get make predictions. Blah.. blah.. you could have read it in the page in detail if you can understand math (unlike me).
What am I trying to do here?
What I was wondering is, Number of Concepts (nodes) in CN5 exceeds way more than what I can imagine (I am yet to count them as I still have 32-bit system with me, and MongoDB can't hold more than 2GB of data on Win32 systems. Sigh!), not to mention the relations of each concept the same with the columns of the matrix. If only I could transfer data from Mongo to HBase and use Mahout's SVD Implemenation to build the required matrice and store it in HBase (again). I guess that should put me to use commonsense dataset based processing of data. I need to process realtime Tweets and FB Posts in Storm for BlueIgnis, would it match the performance on Real-time basis? Is this even possible? I don't have answers to these and many related questions yet. Just an idea, yet to hack into it more.
Let me know if you have already implemented this or working on similar road.
Updates:
Some interesting thought on GraphLab usage and performance over Mahout's implementation, here. (See the comments).
PS: Above idea was thought over a cup of tea and some cake at hand with no work to do. If you have already got anything like this, I would love to hear from you.
No comments:
Post a Comment