Posts tagged screen scraping

Tracking Heathrow with twitter

It's 11.23 on Thursday

A few months back—while we were discussing the number of talking objects appearing on twitter—Jenny pointed out to me that all Heathrow airport arrivals and departures data is online. That set my mind racing, as if you know all the flights leaving that currently controversial airport, there are all manner of things you could begin to do. Working out miles travelled and carbon emitted, spotting delays, and so on. But at the time it all came down to a quick note in Things to some day set aside time to explore.

That day arrived this week. The data turned out to be pretty simple to scrape, with a quick wrapper around hpricot, and to throw into an SQLite database using datamapper to give me a little abstraction and a place to throw a variety of methods to make my code simpler. And then it was a small matter of employing John Nunemaker’s twitter gem to set up regular tweets letting followers in on how many flights in and out of Heathrow there have been lately.

The result is a rather pleasing hourly summary, that adds a little rhythm and background awareness into my day. You can follow it at http://twitter.com/heathrowtower.

Perhaps the biggest frustration with the data is that all destinations/origins are given as city names. Given that city names are hardly unique, and even if they were a given city may have several airports connecting with Heathrow, that makes it a bit trickier to do some of the more sophisticated calculations. My hope is that the flight codes (which are given) can soon be transformed into a list of airport codes, which can then open up a route to more useful and interesting data. (if anyone knows of an existing database that does that mapping, please let me know!)

I’m looking forward to that, but I’m also anticipating the ambient awareness that having the bot running will create. Will the hourly ritual of seeing a sentence or two about Heathrow activity reveal any patterns? If they do, maybe I’ll update the code to make more of those. We’ll see.

For now, please do follow the tower on twitter, tell people about it, send it messages if you spot anything interesting, and feel free to take a look at the code over on github.

Corrected bus routes on Rails

In the process of building my bus route app, I realised that half the data for bus stops is missing. While the site’s developers have done a good job of providing clear data on half the stops, if you want to see stops going in the other direction, you have to use a drop-down box that triggers an AJAX request and repopulates the table.

A little digging shows that the call is to:

http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction={direction}>&routeID={routeid}

which returns an HTML table with the relevant stop data. So in a sense, there are permalinks for each set of stops, but it’d be nice if they were more clearly advertised, particularly since the site as is won’t work for those without javascript switched on.

The other gotcha is that it seems the internal IDs for some routes don’t match their route numbers. If you try and retrieve the westbound stops for Route #14 the call is actually to:

http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction=W&routeID=13

and when you make requests for route 13, the routeID passed is 14. The same disparity continues, suggesting that they’ve (sensibly) added primary keys to their database other than the route number. It turns out that ID is embedded in the markup within a comment showing the direction and the ID. For Route #50 that is:

<div id="stopListWrapper">
<!-- E -> 19 -->
<div id="stopList">
...
</div>
</div>

Since the document is already being parsed using hpricot, we can get that with:

internal_route_id = doc.at("div#stopListWrapper").children[1].to_s.match(/\-\> (\d+) \-\-\>/)[1]

(get the div, note that the comment is the second child, and get the data with a regular expression)

I’ve updated my scraper and the service to grab data based on the correct IDs. The HTML views will follow suit shortly.

Bus routes on Rails

Following on from my previous entry about scraping bus route data from The Rapid’s website, and to begin to demonstrate the possibilities it opens up, I’ve set up a simple web service to provide route and stop data. It’s based on the new REST style from Edge Rails, and routes are scoped by city to allow for future expansion. To get data on Route 1, GET:

http://projects.jystewart.net/buses/cities/1/routes/1

To get a list of the stops within 1.5 miles of a given longitude and latitude, GET:

http://projects.jystewart.net/buses/cities/1/stops/?longitude=X&latitude=Y&distance=1.5

Using Edge Rails, setting up the application was remarkably simple. Three models, three controllers, appropriate use of respond_to blocks, and the right entries in config/routes.rb:

map.resources :cities do |cities| 
  cities.resources :stops
  cities.resources :routes
end

This was the first time I’ve used nested routes so it took a few minutes to work out the correct syntax for the link_to calls. When using nested routes like those above, you must declare first the ID of the city and then the ID of the stop or route, eg:

< %= link_to 'My Route', route_url(city, route) %>

I’m not making any guarantees about the long term availability of the service, but if anyone wants to make use of it, let me know and we can probably work something out. I’ll probably be making use of it myself.

Scraping Grand Rapids bus routes

The Rapid, the bus service for Grand Rapids and surrounding areas, recently redesigned their website. The redesign was long overdue and the result certainly looks a lot cleaner, if still far from inspiring. They’ve added a flash-based map showing their routes (though it could do with being a little larger on the page) and added PDF maps of each route (eg. this one for Route 6). Unfortunately as yet there’s no tool for working out routes, but that’s not a big surprise.

My favourite features, however, are not any of those mentioned above but the fact that each route now has a clean URL (eg. http://www.ridetherapid.org/ride/routes/6/) and a link to google maps for each stop, thereby exposing the coordinates of all the stops. With those two components in place, it becomes very easy to pull out the route data and begin to apply it to other uses. A ruby script (using _why’s excellent hpricot) to do just that would be:

#!/usr/bin/env ruby
require 'rubygems'
require 'open-uri'
require 'hpricot'
 
routes = (1..15).to_a.concat [24,28,37,44,49,50,51]
 
routes.each do |route|
  begin
    route_uri = "http://www.ridetherapid.org/ride/routes/#{route}/stops/"
    doc = Hpricot(open(route_uri))
    title = doc.at("h1").children[0].to_s.strip
    puts title
    stopList = doc.search("div#stopList table tr")
    (1..stopList.size).each do |row|
      unless stopList[row].nil?
        uri = stopList[row].at("a").attributes['href']
        name = stopList[row].at("a").children[0].to_s
        coords = uri.match(/q=(\d+\.\d+),\+(.*?)\&/)
        if coords.class == MatchData
          latitude = coords[1]
          longitude = coords[2]
          puts "#{name} on route #{route} is at #{latitude}, #{longitude}"
        end
      end
    end
  rescue => err
    puts "Problem retrieving #{route_uri}"
  end
end

With this data available, it immediately becomes possible for local people and organizations to make use of it in a variety of ways–businesses could easily show the nearest bus stops to their locations, listing services can help visitors plan their routes, and those of us who aren’t fans of the flash-map could use other services to build alternatives.