Thursday, February 9, 2012

Introducing Scraphp - Web Crawler in PHP


Scraphp (say Scraph, last p is silent) is a web crawling program. It is basically built to be a standalone executable which can crawl websites and store extract useful content out of it. I created this script for a challenge posted by Indix on Jan 2012, where in I was asked to crawl AGMarket (http://agmarknet.nic.in/) to get the prices of all the products, and store their prices. I also had to version the prices such that it should persist across dates.
Scraph was inspired from a similar project called Scrappy, written in Python. This is not an attempt to port it, but just wanted to see how much similar properties can I build from it in less than a day.
One of the major features I would like to call it is, When you crawl the page you can extract entites out of it based on XPath. So basically when we crawl a page I create a bean whose properties are set of values got by applying the given XPath on the page. Each XPath is completely independent of the other. Currently Scraph supports creating only 1 type of object per page.
Hack into the source code, its well commented and easy to modify as per requirement. All the details of the crawling page, XPath queries are all provided in the configuration.php or you can supply your own config file, see the Usage. 
Code is available on my Git Repo - https://github.com/ashwanthkumar/scraphp 
I have tired my best to document the entire code well, and if you feel like any improvements can be made or you have got any suggestions? Please do not hesitate to fork and send me a pull request. 

No comments:

Post a Comment