Posts tagged hpricot
Tracking Heathrow with twitter
Jan 29th
A few months back—while we were discussing the number of talking objects appearing on twitter—Jenny pointed out to me that all Heathrow airport arrivals and departures data is online. That set my mind racing, as if you know all the flights leaving that currently controversial airport, there are all manner of things you could begin to do. Working out miles travelled and carbon emitted, spotting delays, and so on. But at the time it all came down to a quick note in Things to some day set aside time to explore.
That day arrived this week. The data turned out to be pretty simple to scrape, with a quick wrapper around hpricot, and to throw into an SQLite database using datamapper to give me a little abstraction and a place to throw a variety of methods to make my code simpler. And then it was a small matter of employing John Nunemaker’s twitter gem to set up regular tweets letting followers in on how many flights in and out of Heathrow there have been lately.
The result is a rather pleasing hourly summary, that adds a little rhythm and background awareness into my day. You can follow it at http://twitter.com/heathrowtower.
Perhaps the biggest frustration with the data is that all destinations/origins are given as city names. Given that city names are hardly unique, and even if they were a given city may have several airports connecting with Heathrow, that makes it a bit trickier to do some of the more sophisticated calculations. My hope is that the flight codes (which are given) can soon be transformed into a list of airport codes, which can then open up a route to more useful and interesting data. (if anyone knows of an existing database that does that mapping, please let me know!)
I’m looking forward to that, but I’m also anticipating the ambient awareness that having the bot running will create. Will the hourly ritual of seeing a sentence or two about Heathrow activity reveal any patterns? If they do, maybe I’ll update the code to make more of those. We’ll see.
For now, please do follow the tower on twitter, tell people about it, send it messages if you spot anything interesting, and feel free to take a look at the code over on github.
Is it time to upgrade drupal yet?
Mar 27th
Working with a number of non-profits I frequently find myself tasked with extending or upgrading drupal. Each new version of drupal has been a significant step forward and I’m usually keen to get up to date but there’s the small matter of the suite of modules most sites use that need to catch up with changing APIs. With the release of Drupal 6 a few weeks back I found myself wanting a tool that would help me check if my chosen modules were ready for the upgrade yet.
The quick solution I came up was a screen-scraper that will take a YAML file listing the relevant modules and their URLs and check to see if a Drupal 6 version has been released as yet. I wrote it in ruby, because after all that PHP work it’s nice to slip back into a language that feels so comfortable. (and hpricot is delightful for quick scraping solutions).
In case it’s helpful for anyone I’ve popped the ruby code and a sample yaml file up as pasties. Currently it’s invoked from the command line (providing you have ruby and hpricot set up) with:
./drupal-modules.rb my-file.yamlA couple of nice enhancements would be to either grab the list of modules from an existing drupal database, or to set this up as a little web service where people input the modules they use and can check a web page or receive an email when there’s a status update. It could even lead into some discussion of where certain older modules should be considered redundant and/or replaced with newer options. Time’s not likely to let me build that, but I’d love to hear if someone else does.
A little scripting to help with HTML email – bringing styles inline
Dec 4th
As anyone keeping an eye on my deli.cio.us feed may have noticed, quite a few links have appeared to information about the preparation of HTML email. It’s a nasty business, as a quick glance at the website of the email standards project will tell you. But sadly, nasty as it may be, sometimes it has to be done.
Even if the email I send out is going to have CSS scattered inline, for building the templates I’d much rather be able to focus on writing the structure of the document and leave worrying about my CSS for another time, and another file. That wouldn’t get me around the nastiness of having to use tables for anything but the simplest of layouts, but it still feels right to keep the separation for as long as possible.
I had a quick look for a tool that would take a stylesheet and an HTML document, and embed the rules online, but didn’t find one. So I turned to ruby. In theory it should be very easy to build something like this, because of hpricot’s support for CSS selectors. If we had the CSS stored in a hash all it would take would be something like:
require 'hpricot' doc = Hpricot(open('my_page.html') css_as_hash.each do |selector, rule| (doc/selector).set('style', rule) end puts doc
Obviously that wouldn’t play nicely if there were already any styles inline, but for the purposes of this project I assumed there wouldn’t be.
I had a quick look at the cssparser rubygem but found that the sample code threw ‘method not found’ errors so I decided to quickly roll my own class that would take a path to a CSS file, and convert it to a hash. All it took was a few minutes’ work and the result was:
# This class takes a CSS file and provides a method to # parse it into a hash. Usage is: # # parser = SimpleCSSParser.new('/path/to/myfile.css') # hash_of_rules = parser.to_hash # # For more advanced CSS handling check out the cssparser gem # http://code.dunae.ca/css_parser/ class SimpleCSSParser # Receive and open the CSS file, storing its contents def initialize(path_to_file) @css = open(path_to_file).read end # Convert the CSS into a hash, where the keys are the selectors # and the values are the rules def to_hash @to_hash ||= separate_rules.inject({}) do |collection, rule| identifiers, rule = prepare_selectors_and_rule(rule) identifiers.each do |identifier| collection[identifier] ||= '' collection[identifier] += rule end collection end end private def separate_rules @css.split('}') end # Strip comments and extraneous white space from our CSS rules def clean_up_rule(css_rule) css_rule = css_rule.gsub(/\/\*.+?\*\//, '') css_rule.gsub(/\n|\s{2,}/, '') end # Break apart our selector(s) and rule. We return an array # of selectors to allow for situations where multiple selectors # are specified (comma separated) for a single rule def prepare_selectors_and_rule(rule) parts = rule.split('{') selectors = parts[0].split(',').map(&:strip) return selectors, clean_up_rule(parts[1]) end end
With that in place, I can now call:
require 'hpricot' doc = Hpricot(open('my_file.html')) parser = SimpleCSSParser.new('my_file.css') parser.to_hash.each do |selector, rule| (doc/selector).set('style', rule) end puts doc
and have the result I wanted all along. It’s rather brittle because of the way it splits the rules up, and it won’t pull in @include’d files, handle multiple CSS files, or do anything to honour the proper inheritance rules, but for my purposes that’s okay. I bundled it all up in a file that can be called from the command line. You can find that in this pastie.
A nice (and really quite simple) addition would be to take Campaign Monitor’s Guide to CSS Support in Email, parse it and spit out warnings about which email clients will have issues with which CSS rules. If I get round to implementing that I’ll blog about it here. If you get there before me, do post a comment and let me know.
UPDATE (5th Dec ‘07: I’ve posted a follow-up looking at some other Ruby CSS parsers.
Quick and Easy Feeds with Camping
Feb 26th
Rails is great for many things, but for very small apps, it can definitely be overkill. That’s where why the lucky stiff’s Camping micro-framework comes in. Where rails gets you started with a clearly defined structure and generally presumes you’re going to want to use a database, Camping makes no such assumptions and just provides a few nice hooks for micro apps.
I got started using Camping a couple of months ago. With a lot of travel coming up, I’m eager to keep up to date with special deals on flights and frequent flyer miles, and stumbled across milemaven.com which seemed a great source of that information. But it doesn’t provide feeds and I have no desire to visit the site every day, so I decided to dust off hpricot and combine it with Camping to scrape the site and deliver the contents to my news reader.
If I wanted to be strict about MVC, I’d probably do the actual scraping in the model, since for this app that’s the data store/source. But in the interests of simplicity I did the parsing in the controller, and even so my entire controller comes out at only 29 lines. A version with a few extra comments looks like:
module Milemaven::Controllers class Index < R '/(\d*)' def get code # Default to United Airlines code = 109 if code.blank? @url = "http://www.milemaven.com/offers/program/fly/#{code}/" content = '' # I could actually make this more compact by just passing having # hpricot get the URL, but I want to capture the last_modified time and # the charset to use in my feed open(@url, 'User-Agent' => 'Camping Milemaven Atom Feed Scraper') do |f| f.each_line { |line| content < < line } @charset = f.charset @updated = f.last_modified || Time.now end doc = Hpricot(content) rows = doc.search("table.listData tr") @title = doc.at('td.content h3').children[0].to_s # This first couple of rows are headers, so skip them @deals = rows[2..rows.size-1].collect do |row| title = row.search("td")[0]['title'] url = row.search('td')[0].children[1]['href'] { :title => title, :url => url } unless title.nil? or url.nil? end.compact render :index end end end
And then the view is a quick wrapper around Builder to generate an atom feed. The skeleton looks like:
module Milemaven::Controllers def index @headers['Content-Type'] = "application/atom+xml; charset=#{@charset}" xml = Builder::XmlMarkup.new(:target => self) xml.instruct!(:xml, :encoding => @charset) # Generate feed end end
The main limitation of the feeds generated this way is that it’s very hard to get real published/updated dates for the entries, particularly as the server doesn’t always return to the timestamp for the pages correctly.
I’ve actually been playing with making this all a bit more re-usable by setting up a DSL to lay out the scraping rules, meaning that both controller and view become usable for most pages. But it needs a bit more work, so I’ll save it for another (potential) post.
UPDATE (Mar 28th): Boaz Shmueli from Milemaven contacted me to let me know there are some feeds available from that site, such as this one for the route from IAD to TPE.
Corrected bus routes on Rails
Sep 26th
In the process of building my bus route app, I realised that half the data for bus stops is missing. While the site’s developers have done a good job of providing clear data on half the stops, if you want to see stops going in the other direction, you have to use a drop-down box that triggers an AJAX request and repopulates the table.
A little digging shows that the call is to:
http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction={direction}>&routeID={routeid}
which returns an HTML table with the relevant stop data. So in a sense, there are permalinks for each set of stops, but it’d be nice if they were more clearly advertised, particularly since the site as is won’t work for those without javascript switched on.
The other gotcha is that it seems the internal IDs for some routes don’t match their route numbers. If you try and retrieve the westbound stops for Route #14 the call is actually to:
http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction=W&routeID=13
and when you make requests for route 13, the routeID passed is 14. The same disparity continues, suggesting that they’ve (sensibly) added primary keys to their database other than the route number. It turns out that ID is embedded in the markup within a comment showing the direction and the ID. For Route #50 that is:
<div id="stopListWrapper"> <!-- E -> 19 --> <div id="stopList"> ... </div> </div>
Since the document is already being parsed using hpricot, we can get that with:
internal_route_id = doc.at("div#stopListWrapper").children[1].to_s.match(/\-\> (\d+) \-\-\>/)[1]
(get the div, note that the comment is the second child, and get the data with a regular expression)
I’ve updated my scraper and the service to grab data based on the correct IDs. The HTML views will follow suit shortly.