Converting HTML to Textile with Ruby

One of the many tricky decisions to be made when building content management tools is how to allow users to control the basic formatting of their input without breaking your carefully crafted layouts or injecting nasty hacks into your pages. One approach has long been to provide your own markup language. Instead of allowing users to write HTML, let them use bbcode, or markdown, or textile, which have more controlled vocabularies and rules that mean it’s much less likely that problems will occur.

Textile in particular has a nice simple syntax and is increasingly popular thanks to its adoption in products like those of 37signals. In Ruby, there’s the RedCloth library which makes it fast and easy to convert textile to HTML. The one problem is if you already have a body of user generated HTML in your legacy system that needs converting. That’s the situation I found myself in this week and I quickly needed a tool to translate the content so that I could get on with the more interesting parts of the system.

Searching for options, the ClothRed library which offers some translation, but it doesn’t handle important elements like links. I considered patching it to handle the elements I need, but in the end I decided to take a different approach and used the SGML parsing library found here to port a python html2textile parser.

Porting code from python to ruby is a pretty straightforward process as the language’s are so similar on a number of levels, but there were several issues to work through, particularly relating to scoping, and quite a few methods to change to make them feel a little more ruby-ish. I’ve not converted all of the entity handling as I didn’t really need it, but there might be a bit of work to do in making sure character set issues are properly taken care of.

The end result is a piece of code that’s now served its purpose and that I’m unlikely to need again for quite a while. It’s not something that I’m particularly proud of, it could almost certainly be implemented more neatly, but I thought I’d throw it out there in case it could be useful to someone else. Should you be inspired to take it and twist it and turn it into a well-heeled, more robust and properly distributable solution, feel free, but please let me know so that at the very least I can update this entry.

Grab the code here or view it here.

UPDATE (March ’09): I’ve moved the code to gist.github.com as past.ie seems rather unreliable these days

Tags: , , , , ,

8 comments

  1. Your linky to the code is a bit sick.

  2. Thanks. I’ve fixed it now.

  3. I’m getting the following error:

    wrong number of arguments (1 for 2)
    html2textile.rb in “make_block_start_pair” on line 96

    It seems to be calling start_h1 with no attributes which is causing the error. If I default attributes to nil it doesn’t error but makes all headings h7′s.

  4. Ok. Dug around a bit. html2textile.rb lines 95-97 should be something like this:

    define_method "start_h#{num}" do |attrs|
    make_block_start_pair("h#{num}", attrs)
    end

    This passes in the attrs into the make_block_start_pair method which solves the wrong number of arguments issue. I’m still trying to figure out why it’s using h7 instead of h1 or h2. Seems to be calling the correct method name but not working. I’ll comment again if I figure it out.

  5. Not sure what the difference is between for num in 1..6 and an each but the for..in wasn’t working for me. I changed it to:


    %w[1 2 3 4 5 6].each do |num|
    define_method "start_h#{num}" do |attrs|
    make_block_start_pair("h#{num}", attrs)
    end

    define_method "end_h#{num}" do
    make_block_end_pair
    end
    end

    And all was well. Really nice work. Thanks for getting this going. I had something started but never got around to finishing it. Pastie version here (http://pastie.caboo.se/121356)

  6. Thanks John. I guess it’s clear none of the code I was converting contained heading tags…

    I’ve actually done a little bit more refactoring. The download has been updated and the new pastie is at

    http://pastie.caboo.se/121382

  7. Excellent… you should make a gem from it!

    I have a little problem with special chars (é for example) that are “stripped”… any idea on how to fix this?

    Thanks a lot!

  8. Thanks Julien. Special char handling is one of the features I didn’t have time to port from the python code but hope to at some point.

    Releasing a gem would be a good idea. I’m totally snowed under at the moment, but if I manage to find time to write some specs or tests I’ll see about releasing it. Naturally I’ll post to the blog if I do that.