a work on process

Viewing posts tagged: Atom

A couple of releases

9 April 2007 (9:59 am)

By James Stewart
Filed under: Announcements
Tagged: , , , , , ,

In the process of catching up with some neglected tasks, I’ve pushed out new releases of both of my PEAR packages.

Services_Technorati receives a version number bump, and little else. The alpha release was never meant to last quite this long given that it’s merely a port of a very stable package, and it’s finally marked beta. My hope is that the beta release will pick up a few more users to put it through its paces.

I had wondered about adding in some extra classes to encapsulate responses, but at the end of the day simplexml does a decent job, is well documented, and doesn’t add any overhead, so I’m happy just returning its objects and letting people work with them.

There are also a couple of bug fixes for the stable release of XML_Feed_Parser, kindly contributed by users. There are still a couple of outstanding tickets, but they’re issues which require more thought so I’m postponing them for 1.0.3 or 1.1.0.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

Quick and Easy Feeds with Camping

26 February 2007 (7:34 pm)

By James Stewart
Filed under: Notes
Tagged: , , , ,

Rails is great for many things, but for very small apps, it can definitely be overkill. That’s where why the lucky stiff’s Camping micro-framework comes in. Where rails gets you started with a clearly defined structure and generally presumes you’re going to want to use a database, Camping makes no such assumptions and just provides a few nice hooks for micro apps.

I got started using Camping a couple of months ago. With a lot of travel coming up, I’m eager to keep up to date with special deals on flights and frequent flyer miles, and stumbled across milemaven.com which seemed a great source of that information. But it doesn’t provide feeds and I have no desire to visit the site every day, so I decided to dust off hpricot and combine it with Camping to scrape the site and deliver the contents to my news reader.

If I wanted to be strict about MVC, I’d probably do the actual scraping in the model, since for this app that’s the data store/source. But in the interests of simplicity I did the parsing in the controller, and even so my entire controller comes out at only 29 lines. A version with a few extra comments looks like:

module Milemaven::Controllers
  class Index < R '/(\d*)'
    def get code
      # Default to United Airlines
      code = 109 if code.blank?
      @url = "http://www.milemaven.com/offers/program/fly/#{code}/"
	 content = ''
 
	 # I could actually make this more compact by just passing having
	 # hpricot get the URL, but I want to capture the last_modified time and
	 # the charset to use in my feed
      open(@url, 'User-Agent' => 'Camping Milemaven Atom Feed Scraper') do |f|
        f.each_line { |line| content < < line }
        @charset = f.charset
        @updated = f.last_modified || Time.now
      end
 
      doc = Hpricot(content)
 
      rows = doc.search("table.listData tr")
      @title = doc.at('td.content h3').children[0].to_s
 
      # This first couple of rows are headers, so skip them
      @deals = rows[2..rows.size-1].collect do |row|
        title = row.search("td")[0]['title']
        url = row.search('td')[0].children[1]['href']
 
        { :title => title, :url => url  } unless title.nil? or url.nil?
      end.compact
 
      render :index
    end
  end
end

And then the view is a quick wrapper around Builder to generate an atom feed. The skeleton looks like:

module Milemaven::Controllers
 
  def index
    @headers['Content-Type'] = "application/atom+xml; charset=#{@charset}"
    xml = Builder::XmlMarkup.new(:target => self)
    xml.instruct!(:xml, :encoding => @charset)
 
    # Generate feed
  end
end

The main limitation of the feeds generated this way is that it’s very hard to get real published/updated dates for the entries, particularly as the server doesn’t always return to the timestamp for the pages correctly.

I’ve actually been playing with making this all a bit more re-usable by setting up a DSL to lay out the scraping rules, meaning that both controller and view become usable for most pages. But it needs a bit more work, so I’ll save it for another (potential) post.

UPDATE (Mar 28th): Boaz Shmueli from Milemaven contacted me to let me know there are some feeds available from that site, such as this one for the route from IAD to TPE.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

XML_Feed_Parser stable

26 December 2006 (4:25 pm)

By James Stewart
Filed under: Announcements
Tagged: , , , ,

I’ve just released the first stable version of my XML_Feed_Parser library through PEAR. I’ve been working on the code for about 18 months now, it’s nearly a year since the first beta, and some time since I last had to make any significant changes, so it seemed like it was time to open it up to a wider audience.

You can get it through the usual channels, either downloading it directly or using the PEAR installer, and PEAR provides a bug tracker should you find any problems or have ideas for enhancements. My time to implement enhancements will be very limited, so I’m also very interested in hearing from anyone who’d like to sign on as a developer to help keep the package moving forwards.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

XML_Feed_Parser RC2

9 November 2006 (10:24 am)

By James Stewart
Filed under: Announcements
Tagged: , , , ,

I just rolled and released a second release candidate of XML_Feed_Parser. Mohanaraj Gopala Krishnan had pointed out to me that the parsing of atom text constructs wasn’t quite as flexible as the RFC allows for and was kind enough to supply an initial patch to improve support.

Since HTML_Safe isn’t stable yet my plan is to put clear security advice in the manual and then if there aren’t any new issues with this release candidate to release it as a stable version. Once HTML_Safe stabilises I’ll revise the manual, work in support for that and release a new version.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

Collage Mk. 2: Now With Separation

20 August 2006 (4:24 pm)

By James Stewart
Filed under: Announcements
Tagged: , , , , , ,

Last year I posted a few times about the aggregation code I wrote to allow Greenbelt to collect festival-related content scattered around the web and republish it. What I may not have gone into was how frustrating that code tended to be to work with, written in a rush before the festival and heavily patched while on site.

This year, with longer to prepare, I decided to throw that one away and start again. I chose python as the language again, partly because I wanted to use some python libraries and partly because it seemed time to get some more python practice in. I also decided that rather than have the parsers for each service (currently technorati, del.icio.us, flickr, pubsub, and magnolia) each update the database, it was time for some abstraction and layering.

This time around I’ve written independent extraction classes for each of the services I want to use, with each returning its data as atom entries. That atom is then fed into a ‘reasoner’ that checks whether we’ve already seen the entry, and creates or updates our store accordingly. Using atom as the intermediary made sense as much of the data is already sourced in atom (or forms that map closely to it) and the requirements for a unique ID and updated time make updates simple to manage.

It’s also ready-serialized, and so nice and portable. To test each component is working, I just have to inspect the atom code produced, which is easy to do visually or programmatically. If I wanted to spread the code across servers it’d be trivial to do so using a toolkit such as WSGI and the Atom Publishing Protocol.

With the Universal Feed Parser for parsing, and SQLObject for database abstraction, there’s a lot I don’t have to worry about.

The festival is not yet upon us, so the code has yet to be battle-tested. With over 1200 photos posted on flickr last year and a much bigger push this year, we’re expecting a lot of content. It’s good to know we have a cleanly separated, maintainable code base this time around. If it works as well as it should, I’ll try to publish the code somewhere.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 
Next Page »