Saturday, December 25, 2010

Research Engine Update #11

Wow, today was one hell of a day if i should say. Early in the day, I was breaking my head with many concepts, technologies, etc. that I was actually running mad. But, now at the end its slightly better.

Let me brief you what I did today.

  1. Installed DB2, duh.. I never wanted to do this.
  2. Created a SPARQL endpoint for knowledge engine. No more dependence on internet.
  3. Updating (its on the way) my database with all my 28+ GB, of linked data.
I'm so damn tired to type, I'll brief the rest tomorrow.

UPDATE: After 12 hours 40 minutes, 218.8 MB of data has been successfully uploaded to DB2, at this rate..?!@# OMG! When will my 28GB get uploaded?!

- Ashwanth Kumar and Salaikumar @ Saravanan

Wednesday, December 22, 2010

Research Engine Update #10

Update 9.20 AM:
Wow! We're making some serious progress here. Web search engine on Tapestry works, and its live and running.

Search Index is limited to 5% of Google Directory URLs.

Link: http://re-iblue.no-ip.org/


Update 10.10 AM:
Added Cached support to Web search.


Update 10.40 AM:
Added OpenSearch RSS Export.


Update 11.52 AM:
Added General WebHistory management with Cassandra Distributed database.

Research Engine #9

If you had read my SRS, Knowledge Engine uses a tool called wikixtractor. Its my project for extracting various contents of Wikipedia into RDF (N-Triples) format.

Atlast I committed the latest version in to the repo. Its a fork project of DBpedia Extractor.

- Ashwanth Kumar

Monday, December 20, 2010

Research Engine #8

Well, the update is pretty much late and simple. I've deployed Cassandra in our cluster. With a simple tweaking the config file, it was a breeze to set it up. The requirements are simple (Java 1.6, thats it) and easy.

Next on the update for tomorrow is using a JPA compatible library for Cassandra. A beautiful project that helps you get your job done, so quickly and easily.

Project Website : http://code.google.com/p/kundera/
Documentation (excellent for understanding Cassandra Datamodel and Kundera usage) - http://goo.gl/IXOdB

PS: This is an update post.

- Ashwanth Kumar

Tuesday, December 14, 2010

Social Networking Platform (IndiKonn)

Project Name - IndiKonn
Project Scenario - Social Networking Platform
Team Members - Lakshmi Narayanan B, Divya K, Jayalakshmi S, and Prasanna Kumaor




You can also download it from here - http://goo.gl/W2c8j

Sunday, December 12, 2010

Knowledge Engine (Developer Preview)

Hello guys,

Atlast a working prototype for Knowledge Engine module. After a lot of trouble with the implementation technique that is actually feasible, here is the one atlast. You can find the Knowledge Engine residing here. The implementation was done in less than 2 hours, so no templates were designed. Its a basic HTML page powered by PHP.

Usage Mode for KBEngine
born 1974-06-22
- To get all the people born on a specified date

dead 1974-06-22
- To get all the people dead on a specified date

starring rajinikanth aishwarya
- To identify the films by actors

birthplace kumbakonam
- To get all the people's birth place

deathplace kumbakonam
- To get all the people's death place

list Company (watch the capitalization)
- To get the list of all companies available in the system. This can be substituted with any of the following:
  • Company
  • People
  • Actors
  • Airport
  • Country
  • TelevisionShow
  • Artwork
  • FootballEvent
  • Publisher
  • Animal
  • Subject
  • EducationalInstitution
Link: http://ashwanthkumar.in/kbengine/

Please provide your valuable comments for improvement.

PS: This is the Research Engine Update #7

- Ashwanth Kumar

Saturday, December 11, 2010

User Trend Graph - RFC

In my websearch module of the important feature when a user logs in, is the ability to filter results based on their likes and activities. In this post, I'm drafting out the methodology called "User Trend Graph", built using the data from the Facebook Graph API (OpenGraph Protocal).

About User Trend Graph
User Trend Graph (now on called UGraph), is based on time, number and type of activities, the user likes on FB. The following graph depicts the UGraph:

UGraph Sample 3D Graph



Here a sample user's likes is being analyzed, thus his Product/Service category likes are more than Application likes. The data used here is a random. The graph line can be both increasing or decreasing, since the users can Like and Unlike pages on FB, as depicted above.

Over the period of time, user's trend can be calculated, projected and used in any social machine learning algorithms (is there any such?). This can act as a representational medium for the algorithm to act upon.

The time is calculated from the FB's JSON response (see here for a sample), as "created_time", which denotes the date and time when the user made a connection with that node in the graph (OpenGraph of FB).

UGraph, depicts the user's activities on the web at a social networking platform. If the same could be extended to a broader perspective on all web activities. Such as Google (or probably they're using a similar thing, already? No idea!), twitter, FB, etc. can all benefit in understanding their User's context and provide a better service to them.

Since many Web 2.0 services are going social the most profound method to analyze the user's interest and provide a higher degree of user relevance in the context of search engines, user suggestions, etc.

Using UGraph in iBlue
As i said before, we're implementing the concept of UGraph in our iBlue, as a proof of concept application.

Once the user's login with their FB Account, we cache their Likes JSON in our database (as we're hard pressed on resources to access them dynamically) until they login again, during which it is updated. Since the response is in reverse sorted order, one can use binary search technique to identify the previous latest node, delete them if not present (user would've disliked it), insert the remaining ones, updated "created_time" for the existing if they're changed meanwhile.

We use the "category" property as the series for each Like of the user. Then dynamically compute the graph, identify the current taste and trend of the user, and sort the results accordingly for the user to view.

Please provide your feedback on this implementation. If such a method exist please help me improve it, if not lets start using it.

Update: This is also Research Engine Update #6.

- Ashwanth Kumar

Friday, December 10, 2010

Research Engine Update #5

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0xb701077a, pid=10042, tid=3018517360
#
# JRE version: 6.0_22-b04
# Java VM: Java HotSpot(TM) Client VM (17.1-b03 mixed mode, sharing linux-x86 )
# Problematic frame:
# V [libjvm.so+0x19f77a]
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#

--------------- T H R E A D ---------------

Current thread (0x0b878c00): JavaThread "FetcherThread" daemon [_thread_in_vm, id=8797, stack(0xb3e5e000,0xb3eaf000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0x00000004

Registers:
EAX=0x92e57044, EBX=0x00000000, ECX=0x0b879474, EDX=0x0aae7750
ESP=0xb3eac958, EBP=0xb3eacafc, ESI=0x00000000, EDI=0x0b879470
EIP=0xb701077a, CR2=0x00000004, EFLAGS=0x00010286

Top of Stack: (sp=0xb3eac958)
0xb3eac958: 0b878c00 0b878c00 b732ef90 b3eacae0
0xb3eac968: b3eaca88 b70cbdb2 0b878d18 51071390
0xb3eac978: 00000000 00000000 51070d84 51070dd4
0xb3eac988: 00000044 b3eac9d0 00000002 0b87946c
0xb3eac998: b3eac9c8 0b879468 b7319c08 0b879470
0xb3eac9a8: 0000005b 00000000 0000005b 00000000
0xb3eac9b8: 00000000 b726fc06 0b87946c 00000023
0xb3eac9c8: b3eaca08 b4e02388 00000000 510710d8

Instructions: (pc=0xb701077a)
0xb701076a: 00 00 8b 7d 08 8b 75 0c 8b 07 83 c0 24 8b 1c b0
0xb701077a: 8b 43 04 8d 48 08 8b 40 08 51 ff 90 8c 00 00 00
.........................................................

Full stack trace here: Full Stack trace

I've never such JVM runtime errors! Well the fact is that the nutch crawler didn't complete properly. I had to start crawling all over again; and i did so.

Man, this is seriously getting somewhere, i just can't get where?!

Tuesday, December 7, 2010

Research Engine Update #4

OMG! Today is the last date for SRS Submission! I'm yet to start mine!

Fine, let me just brief whats the status about is happening in for iBlue.

Updates:
  1. Got (46.85% as of writing this post) downloading the entire Wikipedia in all languages (but We'll be using only English(en) and German (de) versions) in N-Triples (Around 16GB)! - Thanks a lot to DBPedia, for its public available data set. Its this dataset that powers the Knowledge Engine component as of now, until we think of any other better implementation.
  2. Completed the Code Search (like Google Code Search) in CLI. - We're harvesting the Apache, Google Code, SF.net public SVN repositories (using SVNKit), index them, and provide a search layer on top of it. - Lucene is used here extensively for both indexing and searching.
  3. Completed the Web Search on IBM and SASTRA sites. We're using the industry standard Nutch crawler for crawling the Web, and again Lucene for indexing and searching. Also, using Clustering plugins like Carrot2 and using the Ontology Specifications of Open Calais, result presentation and query processing are improved respectively.
- Ashwanth Kumar and Salaikumar @ Saravanan

Friday, December 3, 2010

Research Engine - SRS (From Google Docs)



You can also download the PDF from http://goo.gl/NMbWR

Research Engine Update #3

Wow! Atlast, Myself and Salai has settled on the final draft of Research Engine. Here are the following features (trust me this list is final, its decided)
  1. Web Search - Combination of WolframAlpha and Google
  2. Code Search - Similar to Google Code Search
Also, user can register using their Facebook account. It also has 2 main sub-components inside it,
  1. To synchronize bookmarks (Facebook Likes to quantize the search results)
  2. Maintain Web History, similar to Google's Web History.
Alpha version for closed circle testing will be out soon. Watch this space for more info.

- Ashwanth Kumar and Salaikumar @ Saravanan

Thursday, December 2, 2010

Research Engine - Update #2

Haaaa, A very good morning. My day kicks of with the need to design (re-architect, atleast I and Salai must have redesigned the entire RE about 5 - 6 times now, we lost count) RE for the final time, before the development work starts (by today afternoon).

Yes, there were earlier prototypes built during our Exams (mainly for testing technologies only). Actual product development starts only today.

So, yes it was decided that RE will not give you any Search results, instead it helps you find information (after processing unstructured data on the web like Google Squared {thanks to @Nivas}, Wolfram, etc.); for which we need to find a way to store the information (knowledge) or in our case (in pure Semantic Web style) Resource.

Thus, after breaking my head in due course sleeping for 35 min. I've come up with a resource representation technique (data structure for resource);

Every single information found on the web is said to wonder in space (Web) in no definite path. We pack such information into an infolet. Since information is unstructured, there is no definite domain or context to the source, unless interpreted from various sources like in Wikipedia.

Infolet - An entity of information containing subject, context, and data (actual info)
Eg. Ram was born on Jan. 1, 1990.
Subject - Ram
Context - {year 1990, birthday}
Data - "Ram was born on Jan. 1, 1990"

Infomat
- Syndicating many infolets based on the subject, forms the Infomat. It tells us, what it is, but no information is provided about where, how, etc.
Eg. General_Info - ({Ram was born on Jan. 1, 1990}, {Ram won his first International physics Olympaid in 2000}, {Ram joined IIT after securing AIR #1 in JEE}, {Ram feel in love with Sita, and married her at 2017}. {Ram and Sita, lived happily ever after});

In the above example, each {infolet} is combined into a single (infomat) containing a collection or set of infolets, based on their Subject. Sequence is still a problem, unless a year is specified in the text, to enhance its context a bit more.

Infonet - Collection of Infomats based on their context (yes, various context information about a particular subject). Infonets are always in proper order, as the context is taken into consideration.

Eg. FB_Wall_Sets - ({Ram joined FB}, {Ram added Stanford to his list of Schools}, {Ram is preparing for his Advanced Operating Systems overnight}, {Ram had a good nice date with the most beautiful girl in the whole galaxy})

From the previous 2 examples, you can find that 2 Infomats named, "General Info" and "FB Wall Sets" are combined into a single Infonet as:

Generally - Ram[General_Info,FB_Wall_Sets,...]
In detail - Ram[General_Info({....},{...}),FB_Wall_Sets({...},{...}),...]

Thus goes my resource representation model, all the subject are (tentatively) linked Facebook (i just love FB for its pure interest and care it takes for Social network, no comments about Privacy Issues okay?). Using FB Connect, every single Member, Page, Group, everything has a RID (resource identification) from which all subject are derived.

PS: We're not afflicted with Facebook by any means.

Any feedback regarding the same will be highly appreciated.

- Ashwanth Kumar

Wednesday, December 1, 2010

Research Engine - Update #1

After a good evening dinner, I and Salai, came back to his room to discuss about Research Engine, and its implementation methods. The first thing, that made us wonder even while going to dinner was - "What exactly does Multi-column database engine does? What is it exactly? How does it differ from traditional fixed column schema of writing DBMS apps? Thus, is implementations like HBase, Cassendra, Voldermort, etc are actually needed for our project?"

When we came back, Salai immersed himself into all this (I'm yet to get updates regarding that from him, as of writing this post); while i was busy prep'ng the system with Ubuntu 10.04 on Central node (Salai's PC) and updating it. Meanwhile also updating my farm at Farmville ;)

I was so damn tired after clicking through my 350+ plots of land for seeding, i decided to call it a day and went to sleep.

- Ashwanth Kumar