My Standalone Complexities: Project UIM

Update May 3rd, 2011 - Now UIM consumes Blogger feeds too. :)

For over a week or is two? I don't seem to remember. I started working on something, I would like to call it as Project UIM (Unstructured Information Management). Its basically iBlue - mini, something like that. Trying to maintain a repo of largest Structured Information for computer systems.

I took the following site as my source of content - Gizmodo, LifeHacker, Mashable and Tech Crunch. For almost a week all their articles posted, within an hour comes into my system for processing. Using the Semantic Extractor Component (available here for preview), I was able to extract some useful re-usable information (yeah! re-usable information) out of them.

I just ran some diagnosis on them today, and would like to share some stats regarding that. My system has consumed over a 1061 (at the time of writing this space) articles from the above said sources, and identified over 3368 entities (in over 34 types), 18 categories and 46 relations.

Below a stats of the entity count and its type. :-)

count	type
124	City
522	Company
5	Continent
64	Country
6	Currency
4	EmailAddress
7	EntertainmentAwardEvent
115	Facility
11	Holiday
852	IndustryTerm
2	MarketIndex
10	MedicalCondition
24	Movie
19	MusicAlbum
29	MusicGroup
18	NaturalFeature
15	OperatingSystem
154	Organization
617	Person
13	PhoneNumber
1	PoliticalEvent
380	Position
80	Product
11	ProgrammingLanguage
39	ProvinceOrState
49	PublishedMedium
3	RadioStation
8	Region
10	SportsEvent
5	SportsLeague
114	Technology
16	TVShow
1	TVStation
28	URL

That's in my opinion a very decent for a pre-alpha implemented system.

I'm planning to expand the data source, to try to see how well Project UIM can tame the beast, Internet. Any other suggestions regarding the same is welcome. Please leave them as a comment.

My Standalone Complexities

Tuesday, May 3, 2011

Project UIM - Going Good

No comments:

Post a Comment