Tuesday, May 3, 2011

Project UIM - Going Good

Update May 3rd, 2011 - Now UIM consumes Blogger feeds too. :)

For over a week or is two? I don't seem to remember. I started working on something, I would like to call it as Project UIM (Unstructured Information Management). Its basically iBlue - mini, something like that. Trying to maintain a repo of largest Structured Information for computer systems.

I took the following site as my source of content - Gizmodo, LifeHacker, Mashable and Tech Crunch. For almost a week all their articles posted, within an hour comes into my system for processing. Using the Semantic Extractor Component (available here for preview), I was able to extract some useful re-usable information (yeah! re-usable information) out of them.

I just ran some diagnosis on them today, and would like to share some stats regarding that. My system has consumed over a 1061 (at the time of writing this space) articles from the above said sources, and identified over 3368 entities (in over 34 types), 18 categories and 46 relations.

Below a stats of the entity count and its type. :-)

count

type

124

City

522

Company

5

Continent

64

Country

6

Currency

4

EmailAddress

7

EntertainmentAwardEvent

115

Facility

11

Holiday

852

IndustryTerm

2

MarketIndex

10

MedicalCondition

24

Movie

19

MusicAlbum

29

MusicGroup

18

NaturalFeature

15

OperatingSystem

154

Organization

617

Person

13

PhoneNumber

1

PoliticalEvent

380

Position

80

Product

11

ProgrammingLanguage

39

ProvinceOrState

49

PublishedMedium

3

RadioStation

8

Region

10

SportsEvent

5

SportsLeague

114

Technology

16

TVShow

1

TVStation

28

URL


That's in my opinion a very decent for a pre-alpha implemented system.

I'm planning to expand the data source, to try to see how well Project UIM can tame the beast, Internet. Any other suggestions regarding the same is welcome. Please leave them as a comment.

No comments:

Post a Comment