Last year I posted a few times about the aggregation code I wrote to allow Greenbelt to collect festival-related content scattered around the web and republish it. What I may not have gone into was how frustrating that code tended to be to work with, written in a rush before the festival and heavily patched while on site.
This year, with longer to prepare, I decided to throw that one away and start again. I chose python as the language again, partly because I wanted to use some python libraries and partly because it seemed time to get some more python practice in. I also decided that rather than have the parsers for each service (currently technorati, del.icio.us, flickr, pubsub, and magnolia) each update the database, it was time for some abstraction and layering.
This time around I’ve written independent extraction classes for each of the services I want to use, with each returning its data as atom entries. That atom is then fed into a ‘reasoner’ that checks whether we’ve already seen the entry, and creates or updates our store accordingly. Using atom as the intermediary made sense as much of the data is already sourced in atom (or forms that map closely to it) and the requirements for a unique ID and updated time make updates simple to manage.
It’s also ready-serialized, and so nice and portable. To test each component is working, I just have to inspect the atom code produced, which is easy to do visually or programmatically. If I wanted to spread the code across servers it’d be trivial to do so using a toolkit such as WSGI and the Atom Publishing Protocol.
The festival is not yet upon us, so the code has yet to be battle-tested. With over 1200 photos posted on flickr last year and a much bigger push this year, we’re expecting a lot of content. It’s good to know we have a cleanly separated, maintainable code base this time around. If it works as well as it should, I’ll try to publish the code somewhere.