Tuesday, February 06, 2007

Ruby Scraping

You've always got to pick the right tool for the right job, and many would say that Ruby is the right tool for most jobs. It is pretty good, thats for sure.

But with hpricot, it is a no brainer when it comes to web scraping. hpricot is a library for extracting contents from web pages to do with what you will. Chief amongst the features you'll want for such a library are simple and fast ways to parse the tree of the site you are scraping, and hpricot has them in abundance. I haven't found anything simpler.

And then just now I find out about the firebug extension for firefox. One of the tricky things with scraping is manually figuring out the path through the tree you need to traverse to get to the bit of the page you are looking for. This blog shows how much simpler it is with firebug...

Ruby Screen-Scraper in 60 Seconds

No comments: