Blog
Screen-Scraping and RSS Feeds
April 29, 2009
Much of what I read on a regular basis comes to me via RSS feeds. Sure, they're not for everybody but I'm hooked. The downside of this is the rare site I'd like to follow but doesnt provide a feed. Whats a geek to do but write some code to generate a feed?
Specifically, it was sweet adeline that drove me to figure out how to go about this. Charlie has been working on the site forever, it seems, (thank you!) and I suspect he doesnt have the time or resources to move the site, especially the eight-plus years of news, to a content management system or blogging platform that would generate a feed for him.
Generating markup (HTML, XML) is the easy part but parsing it, especially the prone-to-sloppiness HTML, is not my idea of a good time. HTML, however, is all I had to work with from sweet adeline so this endeavor started with little more than a sense of doom. Thankfully I ran across Beautiful Soup and the only drawback was that I'd have to learn some python but I had been meaning to do that since I read Programming Collective Intelligence in which all the example code is in python.
For example, to grab the first title for the feed, I used Beautiful Soup to find the content in this in HTML:
<TD width=580 vAlign=top bordercolor="#CCCCFF" bgcolor="#FFFFFF" class=usual><P><strong><font size="2" face="Verdana, Arial, Helvetica, sans-serif">33
1/3 series xo book coming in april, larry crane interview, josie cotton,
magnet's the over/under + more</font></strong></P>
using this:
soup.find( 'td', {'width' : 580} ).p.strong.font
From there I found the next paragraph to use as the feed item's description. Then I looped through all of the HRs to find the rest of the news / feed items using soup.findAll('hr')
and then I had all the data I needed for a feed. The resulting feed is at: feeds2.feedburner.com/SweetAdeline
Along the way I discovered that python pre-compiles regular expressions (I replaced some extraneous whitespace and stripped HTML tags using them). I also discovered that parsing the (inconsistently formatted) date at the end of each post was less than fun so there are no dates in the feed.
What I love about solutions like this is the succinct code. I made no effort to minimize my code and, in the end, my script was 56 lines of code (without whitespace and comments) and only 23 lines if I exclude the newlines in the RSS output.
It is worth noting that the future of Beautiful Soup is sketchy.