But with hpricot, it is a no brainer when it comes to web scraping. hpricot is a library for extracting contents from web pages to do with what you will. Chief amongst the features you'll want for such a library are simple and fast ways to parse the tree of the site you are scraping, and hpricot has them in abundance. I haven't found anything simpler.
And then just now I find out about the firebug extension for firefox. One of the tricky things with scraping is manually figuring out the path through the tree you need to traverse to get to the bit of the page you are looking for. This blog shows how much simpler it is with firebug...
Ruby Screen-Scraper in 60 Seconds
No comments:
Post a Comment