Update May 3rd, 2011 - Now UIM consumes Blogger feeds too. :)
I took the following site as my source of content - Gizmodo, LifeHacker, Mashable and Tech Crunch. For almost a week all their articles posted, within an hour comes into my system for processing. Using the Semantic Extractor Component (available here for preview), I was able to extract some useful re-usable information (yeah! re-usable information) out of them.
I just ran some diagnosis on them today, and would like to share some stats regarding that. My system has consumed over a 1061 (at the time of writing this space) articles from the above said sources, and identified over 3368 entities (in over 34 types), 18 categories and 46 relations.
Below a stats of the entity count and its type. :-)
count | type |
124 | City |
522 | Company |
5 | Continent |
64 | Country |
6 | Currency |
4 | EmailAddress |
7 | EntertainmentAwardEvent |
115 | Facility |
11 | Holiday |
852 | IndustryTerm |
2 | MarketIndex |
10 | MedicalCondition |
24 | Movie |
19 | MusicAlbum |
29 | MusicGroup |
18 | NaturalFeature |
15 | OperatingSystem |
154 | Organization |
617 | Person |
13 | PhoneNumber |
1 | PoliticalEvent |
380 | Position |
80 | Product |
11 | ProgrammingLanguage |
39 | ProvinceOrState |
49 | PublishedMedium |
3 | RadioStation |
8 | Region |
10 | SportsEvent |
5 | SportsLeague |
114 | Technology |
16 | TVShow |
1 | TVStation |
28 | URL |
That's in my opinion a very decent for a pre-alpha implemented system.
I'm planning to expand the data source, to try to see how well Project UIM can tame the beast, Internet. Any other suggestions regarding the same is welcome. Please leave them as a comment.
No comments:
Post a Comment