Collage Phase 1 Review

I wanted to get a few thoughts on this written up, and thought this as good a place as any to post it… Once the code is a little more stable, I’ll probably post it somewhere for others to use.

The first phase of the " Collage" experiment has been very successful. Less than a week after the festival it had indexed around 1400 postings, largely made up of over 1100 flickr photos. It’s helped us get a good overview of online discussion of Greenbelt and provided a great supplement to the photographs and reviews already gathered on the festival website. On top of that, it required very little attention over the festival weekend, which is a benefit not to be underestimated. And ‘greenbelt2005’ rose as high as #3 in the top tags of its given week on flickr.

Over time I have tinkered with the underlying software quite a bit. The first version of the data mining software was written in quite a rush and was a very basic layer between the universal feed parser and a MySQL database. As well as quickly realising the need to add watchlists for links to greenbelt.org.uk from pubsub.com and technorati as word about tagging hadn’t spread far enough in time, I soon found that as people began to upload larger quantities of photos, relying on the flickr atom feed wasn’t going to suffice.

At that stage I broke the code into three parts, a database interface class, a general feed parsing class for most content, and a class based on the flickr API to gather photos. Without much space to test I also wrote a de-duping script which came in very handy. The flickr interface checks all photos added within a given period, which is set to be slightly longer than that between runs of the scripts in order to ensure no photos are missed.

I also added to the collection a script that takes the URL of a blog entry, visits that entry and picks up any feed that might be auto-detectable from it. I used that to pull out more content for entries flagged up using del.icio.us and technorati and to look for more tags/categories since technorati’s feed only provides details of the tag queried on rather than listing all an entry’s categories (nb. Technorati have just added an API call to get a list of an entry’s tags. That’s a good step, though to reduce overheads it would be nice to have them in the feed too). At present that script is run manually, but I intend to roll it into the main codebase so that every entry goes through that process.

There are a number of enhancements that I would still like to make to the project at the data-gathering level, such as improved detection of entry dates, a way to capture world events (perhaps using bbc or google news feeds) to provide the event with some context, and a re-indexing system to pick up changed content and/or new tags.

Future iterations of the scripts will need to be more sensitive to licensing issues (which would probably need to be accompanied with improved education for community members on licensing options for blogs and photos). For an event such as greenbelt which sees people come from across a wide geographical area to one focussed location, it would be good to have geo-data so that we could plot the coalescing of the points from which entries are made, but it is unlikely that we will have a critical mass of data for that any time soon.

The combination of Python for the backend and PHP for the frontend worked well. Though chosen simply because of some server-configuration issues beyond our control, the two languages were very well suited to their tasks. Working mainly from syndication feeds rather than dealing with the various APIs was a great way to prototype the system, but doesn’t really scale. Hopefully as atom becomes more established and we become more used to using feeds for data exchange the feeds will become more customisable, but for now the APIs are most appropriate for apps that need to deal with larger volumes of content.

Aside from purely technical issues, this first deployment has shown that there is still some way to go in educating users about the power and flexibility of tagging. While many users have happily added the requisite HTML (which was provided) for tagging blog entries ‘greenbelt2005’ few have added any other tags, and most of the flickr photos are solely tagged ‘greenbelt2005.’ This means that the navigation options and our ability to track which areas of the festival attracted most festival-goer photography are considerably below what we might have hoped.