a work on process

Viewing posts tagged: hpricot

Is it time to upgrade drupal yet?

27 March 2008 (11:52 am)

By James Stewart
Filed under: Notes, Snippets
Tagged: , , , , ,

Working with a number of non-profits I frequently find myself tasked with extending or upgrading drupal. Each new version of drupal has been a significant step forward and I’m usually keen to get up to date but there’s the small matter of the suite of modules most sites use that need to catch up with changing APIs. With the release of Drupal 6 a few weeks back I found myself wanting a tool that would help me check if my chosen modules were ready for the upgrade yet.

The quick solution I came up was a screen-scraper that will take a YAML file listing the relevant modules and their URLs and check to see if a Drupal 6 version has been released as yet. I wrote it in ruby, because after all that PHP work it’s nice to slip back into a language that feels so comfortable. (and hpricot is delightful for quick scraping solutions).

In case it’s helpful for anyone I’ve popped the ruby code and a sample yaml file up as pasties. Currently it’s invoked from the command line (providing you have ruby and hpricot set up) with:

./drupal-modules.rb my-file.yaml

A couple of nice enhancements would be to either grab the list of modules from an existing drupal database, or to set this up as a little web service where people input the modules they use and can check a web page or receive an email when there’s a status update. It could even lead into some discussion of where certain older modules should be considered redundant and/or replaced with newer options. Time’s not likely to let me build that, but I’d love to hear if someone else does.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

A little scripting to help with HTML email - bringing styles inline

4 December 2007 (8:22 pm)

By James Stewart
Filed under: Snippets
Tagged: , , , ,

As anyone keeping an eye on my deli.cio.us feed may have noticed, quite a few links have appeared to information about the preparation of HTML email. It’s a nasty business, as a quick glance at the website of the email standards project will tell you. But sadly, nasty as it may be, sometimes it has to be done.

Even if the email I send out is going to have CSS scattered inline, for building the templates I’d much rather be able to focus on writing the structure of the document and leave worrying about my CSS for another time, and another file. That wouldn’t get me around the nastiness of having to use tables for anything but the simplest of layouts, but it still feels right to keep the separation for as long as possible.

I had a quick look for a tool that would take a stylesheet and an HTML document, and embed the rules online, but didn’t find one. So I turned to ruby. In theory it should be very easy to build something like this, because of hpricot’s support for CSS selectors. If we had the CSS stored in a hash all it would take would be something like:

require 'hpricot'
doc = Hpricot(open('my_page.html')
 
css_as_hash.each do |selector, rule|
  (doc/selector).set('style', rule)
end
 
puts doc

Obviously that wouldn’t play nicely if there were already any styles inline, but for the purposes of this project I assumed there wouldn’t be.

I had a quick look at the cssparser rubygem but found that the sample code threw ‘method not found’ errors so I decided to quickly roll my own class that would take a path to a CSS file, and convert it to a hash. All it took was a few minutes’ work and the result was:

# This class takes a CSS file and provides a method to
# parse it into a hash. Usage is:
# 
# parser = SimpleCSSParser.new('/path/to/myfile.css')
# hash_of_rules = parser.to_hash
#
# For more advanced CSS handling check out the cssparser gem
# http://code.dunae.ca/css_parser/
class SimpleCSSParser
 
  # Receive and open the CSS file, storing its contents
  def initialize(path_to_file)
    @css = open(path_to_file).read
  end
 
  # Convert the CSS into a hash, where the keys are the selectors
  # and the values are the rules
  def to_hash
    @to_hash ||= separate_rules.inject({}) do |collection, rule|
      identifiers, rule = prepare_selectors_and_rule(rule)
      identifiers.each do |identifier|
        collection[identifier] ||= ''
        collection[identifier] += rule
      end
      collection
    end
  end
 
  private
    def separate_rules
      @css.split('}')
    end
 
    # Strip comments and extraneous white space from our CSS rules
    def clean_up_rule(css_rule)
      css_rule = css_rule.gsub(/\/\*.+?\*\//, '')
      css_rule.gsub(/\n|\s{2,}/, '')
    end
 
    # Break apart our selector(s) and rule. We return an array
    # of selectors to allow for situations where multiple selectors
    # are specified (comma separated) for a single rule
    def prepare_selectors_and_rule(rule)
      parts = rule.split('{')
      selectors = parts[0].split(',').map(&:strip)
      return selectors, clean_up_rule(parts[1])
    end
end

With that in place, I can now call:

require 'hpricot'
 
doc = Hpricot(open('my_file.html'))
parser = SimpleCSSParser.new('my_file.css')
 
parser.to_hash.each do |selector, rule|
  (doc/selector).set('style', rule)
end
 
puts doc

and have the result I wanted all along. It’s rather brittle because of the way it splits the rules up, and it won’t pull in @include’d files, handle multiple CSS files, or do anything to honour the proper inheritance rules, but for my purposes that’s okay. I bundled it all up in a file that can be called from the command line. You can find that in this pastie.

A nice (and really quite simple) addition would be to take Campaign Monitor’s Guide to CSS Support in Email, parse it and spit out warnings about which email clients will have issues with which CSS rules. If I get round to implementing that I’ll blog about it here. If you get there before me, do post a comment and let me know.

UPDATE (5th Dec ‘07: I’ve posted a follow-up looking at some other Ruby CSS parsers.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

Quick and Easy Feeds with Camping

26 February 2007 (7:34 pm)

By James Stewart
Filed under: Notes
Tagged: , , , ,

Rails is great for many things, but for very small apps, it can definitely be overkill. That’s where why the lucky stiff’s Camping micro-framework comes in. Where rails gets you started with a clearly defined structure and generally presumes you’re going to want to use a database, Camping makes no such assumptions and just provides a few nice hooks for micro apps.

I got started using Camping a couple of months ago. With a lot of travel coming up, I’m eager to keep up to date with special deals on flights and frequent flyer miles, and stumbled across milemaven.com which seemed a great source of that information. But it doesn’t provide feeds and I have no desire to visit the site every day, so I decided to dust off hpricot and combine it with Camping to scrape the site and deliver the contents to my news reader.

If I wanted to be strict about MVC, I’d probably do the actual scraping in the model, since for this app that’s the data store/source. But in the interests of simplicity I did the parsing in the controller, and even so my entire controller comes out at only 29 lines. A version with a few extra comments looks like:

module Milemaven::Controllers
  class Index < R '/(\d*)'
    def get code
      # Default to United Airlines
      code = 109 if code.blank?
      @url = "http://www.milemaven.com/offers/program/fly/#{code}/"
	 content = ''
 
	 # I could actually make this more compact by just passing having
	 # hpricot get the URL, but I want to capture the last_modified time and
	 # the charset to use in my feed
      open(@url, 'User-Agent' => 'Camping Milemaven Atom Feed Scraper') do |f|
        f.each_line { |line| content < < line }
        @charset = f.charset
        @updated = f.last_modified || Time.now
      end
 
      doc = Hpricot(content)
 
      rows = doc.search("table.listData tr")
      @title = doc.at('td.content h3').children[0].to_s
 
      # This first couple of rows are headers, so skip them
      @deals = rows[2..rows.size-1].collect do |row|
        title = row.search("td")[0]['title']
        url = row.search('td')[0].children[1]['href']
 
        { :title => title, :url => url  } unless title.nil? or url.nil?
      end.compact
 
      render :index
    end
  end
end

And then the view is a quick wrapper around Builder to generate an atom feed. The skeleton looks like:

module Milemaven::Controllers
 
  def index
    @headers['Content-Type'] = "application/atom+xml; charset=#{@charset}"
    xml = Builder::XmlMarkup.new(:target => self)
    xml.instruct!(:xml, :encoding => @charset)
 
    # Generate feed
  end
end

The main limitation of the feeds generated this way is that it’s very hard to get real published/updated dates for the entries, particularly as the server doesn’t always return to the timestamp for the pages correctly.

I’ve actually been playing with making this all a bit more re-usable by setting up a DSL to lay out the scraping rules, meaning that both controller and view become usable for most pages. But it needs a bit more work, so I’ll save it for another (potential) post.

UPDATE (Mar 28th): Boaz Shmueli from Milemaven contacted me to let me know there are some feeds available from that site, such as this one for the route from IAD to TPE.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

Corrected bus routes on Rails

26 September 2006 (8:08 am)

By James Stewart
Filed under: Commentary
Tagged: , , , , ,

In the process of building my bus route app, I realised that half the data for bus stops is missing. While the site’s developers have done a good job of providing clear data on half the stops, if you want to see stops going in the other direction, you have to use a drop-down box that triggers an AJAX request and repopulates the table.

A little digging shows that the call is to:

http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction={direction}>&routeID={routeid}

which returns an HTML table with the relevant stop data. So in a sense, there are permalinks for each set of stops, but it’d be nice if they were more clearly advertised, particularly since the site as is won’t work for those without javascript switched on.

The other gotcha is that it seems the internal IDs for some routes don’t match their route numbers. If you try and retrieve the westbound stops for Route #14 the call is actually to:

http://www.ridetherapid.org/includes/ajax_return.php?mode=routestops&direction=W&routeID=13

and when you make requests for route 13, the routeID passed is 14. The same disparity continues, suggesting that they’ve (sensibly) added primary keys to their database other than the route number. It turns out that ID is embedded in the markup within a comment showing the direction and the ID. For Route #50 that is:

&lt;div id="stopListWrapper">
&lt;!-- E -> 19 -->
&lt;div id="stopList">
...
&lt;/div>
&lt;/div>

Since the document is already being parsed using hpricot, we can get that with:

internal_route_id = doc.at("div#stopListWrapper").children[1].to_s.match(/\-\> (\d+) \-\-\>/)[1]

(get the div, note that the comment is the second child, and get the data with a regular expression)

I’ve updated my scraper and the service to grab data based on the correct IDs. The HTML views will follow suit shortly.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]

 

The Rapid, the bus service for Grand Rapids and surrounding areas, recently redesigned their website. The redesign was long overdue and the result certainly looks a lot cleaner, if still far from inspiring. They’ve added a flash-based map showing their routes (though it could do with being a little larger on the page) and added PDF maps of each route (eg. this one for Route 6). Unfortunately as yet there’s no tool for working out routes, but that’s not a big surprise.

My favourite features, however, are not any of those mentioned above but the fact that each route now has a clean URL (eg. http://www.ridetherapid.org/ride/routes/6/) and a link to google maps for each stop, thereby exposing the coordinates of all the stops. With those two components in place, it becomes very easy to pull out the route data and begin to apply it to other uses. A ruby script (using _why’s excellent hpricot) to do just that would be:

#!/usr/bin/env ruby
require 'rubygems'
require 'open-uri'
require 'hpricot'
 
routes = (1..15).to_a.concat [24,28,37,44,49,50,51]
 
routes.each do |route|
  begin
    route_uri = "http://www.ridetherapid.org/ride/routes/#{route}/stops/"
    doc = Hpricot(open(route_uri))
    title = doc.at("h1").children[0].to_s.strip
    puts title
    stopList = doc.search("div#stopList table tr")
    (1..stopList.size).each do |row|
      unless stopList[row].nil?
        uri = stopList[row].at("a").attributes['href']
        name = stopList[row].at("a").children[0].to_s
        coords = uri.match(/q=(\d+\.\d+),\+(.*?)\&/)
        if coords.class == MatchData
          latitude = coords[1]
          longitude = coords[2]
          puts "#{name} on route #{route} is at #{latitude}, #{longitude}"
        end
      end
    end
  rescue => err
    puts "Problem retrieving #{route_uri}"
  end
end

With this data available, it immediately becomes possible for local people and organizations to make use of it in a variety of ways–businesses could easily show the nearest bus stops to their locations, listing services can help visitors plan their routes, and those of us who aren’t fans of the flash-map could use other services to build alternatives.

Recommend this post:

[Slashdot] [Digg] [Reddit] [del.icio.us] [Facebook] [Technorati] [Google] [StumbleUpon]