Tue 2012-02-21 10:17

Reading RSS and Atom Feeds

I recently started writing my own RSS and Atom feed reader. I’ll write another blog post about why and how I am doing that. Today I want to focus just on one part of the job: Reading the RSS and Atom feeds.

You should think thats an easy job. Lots of people need to parse RSS and Atom feeds so there should be libraries around. All I need is something that reads the feeds and stuffs the result in an Sqlite database. I don’t really care much which programming language this part uses, because it is separate from the user interface anyway.

So first I tried by favourite programming language: Ruby. Great, an RSS/Atom library is part of the standard Ruby library. But I couldn’t figure out how to use it. It has millions of classes for all the different RSS versions (at least three different versions are in active use out on the Interwebs) and Atom feeds. But I don’t care what format the feed is in. All I want is the data in a normalized form, irrespective of the type of feed it came from.

So I looked around and found some other Ruby libraries, but they either wouldn’t install because of broken dependencies or not handle the Atom format or not give me all the data I need.

Ok, lets check the old workhorse: Perl. Yes, there is a library called XML::Feed that does exactly what I need: Reads all kinds of feeds and gives me one interface to them. Great. A quick script to test it and… everything works fine, except I get problems with Unicode characters. That old Perl malady. Perl is from those old days when Unicode did not exists. And somehow despite all the features that have been built into Perl to make it work, it still does not work out of the box. There is a long man page ‘perlunicode’ that explains everything, lots of options to fiddle with. I fiddled for a few minutes and could not get it to work. I remember doing that years ago. This is the new millenium. I don’t want to do this any more.

Okay, back to the drawing board. Looked at a few more things, even C++ libraries. But then I decided, I could just roll my own. Take some XML library to parse the feed into a DOM and use XPath on it. How hard can it be?

First try with REXML, the XML library that comes with Ruby. Didn’t work because it complained some feed had invalid XML. The XML looked ok to me and to xmllint. Okay, try the Nokogiri XML library. I had seen/heard several people praise it. The documentation is not that great, but with a bit of experimentation I could get it to run.

I read the feed, let Nokogiri generate a DOM and then use XPath to get the data out. Nokogiri has this very practical function remove_namespaces!. Call it on the XML document and all the namespaces are gone. Only the plain XML elements remain. So I don’t have to think about all the different versions and dialects of RSS and Atom and their optional add-ons. I just try if there is a /feed/title or /rss/channel/title and so on and take the first one I find. If there is no description then it might be called subtitle in this version, etc. Not the cleanest way to parse XML, but hey, at least I am not using Regular Expressions. Some experiments and few helper functions later I now have about 150 lines of code that update all the feeds in the database.

It took me only a few hours to write that code, about the same amount of time I spent hunting for a library. That time includes rooting around RSS spec docs and trying it out on the hundred-odd feeds that I regularly read. See, it wasn’t so hard after all!

I am not sure what the lesson here is. Sometimes it is better to roll your own? Or: You can be more productive if you have a good library to built on (in this case Nokogiri)? Probably both.

Tags: dev · rss