Posts tagged RDF

Book Review: PHP Web 2.0 Mashup Projects

The market for books about mashups has become fairly crowded over the past few years but none have really enticed me as from a casual look most seem more interested in following the trend than offering solid information. Thankfully PHP Web 2.0 Mashup Projects manages to slide in a good number of practical programming tips as it works its way through a variety of services.

The book dedicates the majority of each chapter to more general concerns than just interfacing with the system in the chapter’s title. So Chapter 2—”Buy It On Amazon”—spends most of its time exploring XML-RPC and REST approaches and building tools to work with those different styles of interface. Similarly the next chapter spends most of its time introducing WSDL, XML Schema and SOAP before showing how they can be used with Microsoft Live Search.

In fact, that chapter may be one of the best introductions I’ve seen for developers who need to quickly grasp the basics of WSDL and SOAP, a topic that can far too easily get bogged down in complexity that isn’t needed for basic usage. With the WS-* stack quickly and for good reason going out of fashion hopefully most developers won’t have to spend much time with it, but a simple overview is still very handy.

I was intrigued to see the final chapter diving into use of RDF with the RAP toolkit. Like the SOAP section, this managed to boil the basics of RDF down very well and should help most moderately experienced PHP developers to get up to speed quickly.

Aside from a closing section on race conditions, not much time is given to handling interruptions in service from third-party services and in a book focussed on mashups that’s disappointing, particularly as the number of services, and so the range of fallback options, is increasing. Some of the examples are likely to fail if services time out and it would be good to spend some time on helping developers avoid that.

Reading the book as someone who has mostly left the PHP fold for pastures new was a reminder of how easy tools like hpricot make life for screen scrapers, but also that good structure can emerge in PHP code and that the SOAP tools are actually quite good for simple uses. The book is unlikely to appeal to those who don’t do much work with PHP, but if you’re a PHP developer and want to dive into mashups and web services for the first time, it’s worth a look.

Disclaimer: I was sent a copy of this book for review by the publisher, and offered another in return for a timely review. You can find it at packt, amazon US, amazon UK and all sorts of other places.

Content management with subversion

A recent comment reminded me of an old entry proposing yet another project I never had time to follow through with: Using Trac and Subversion with Social Documents. The idea there was to make use of subversion’s utility for version control and trac’s existing frontend for browsing that to present versioned documents.

In hindsight, I don’t think trac would actually be a good frontend for this unless the intended audience was entirely techies. Trac works for those of us who use it every day to follow a variety of projects, and its ability to combine a wiki with version control of the ‘official’ versions of documents provides some interesting ideas, but the interface just wouldn’t work.

But even if I’m unlikely to get time to play with it, I’m still interested in the idea of using subversion as the core for content management. It seems a sensible application of “small pieces loosely joined” to use a proven version control system as one layer in a system. So I’ll be interested to follow Bob DuCharme’s work to use subversion for Digital Asset Management in a CMS.

Bob’s looking into svn’s ability to store arbitrary metadata to store RDF relating to each revision and exploring how its hook mechanism an be employed to make it all work. As ever the proof will be in the interface, but the underlying principles definitely deserve exploration.

Feed Parser: Universal Feed Parser Tests

Inspired by Sam Ruby’s work on applying the Universal Feed Parser tests to the Ruby FeedTools, I’ve spent a little time this afternoon working on testing XML_Feed_Parser with that same test suite. There’s a lot of work to do!

UFP’s tests consist of a series of feed files, some well-formed, and some illformed, with a description and test condition defined at the top of the file. eg.

<!--
Description: channel description
Expect:      not bozo and feed['description'] == u'Example description'
-->
<rdf :RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://purl.org/rss/1.0/">
<channel rdf:about="http://example.com/index.rdf">
<description>Example description</description>
</channel>
</rdf>

So far all I’ve done is run a script through all the tests for well-formed feeds, testing whether XML_Feed_Parser throws an exception when I try and interpret them. When run against the current CVS, 1181 of the 1273 feeds parsed successfully and 92 failed. 68 of those failures were due to encoding problems (which I’ll try and work around, but won’t be able to cleanly fix until PHP has full unicode support), and another 17 were a result of not supporting CDF, leaving another seven I need to get fixed asap.

The next stage will be to translate the ‘Expect:’ values into something I can use in a PHP test case. I’ve done a little searching for a python lexer for PHP, but aside from this embedded interpreter that hasn’t had a release in nearly three years, I haven’t found one. Lacking the time to write such a beast myself, I suspect I’ll simply put together a series of regexps to do the translation necessary.

Of course, XML_Feed_Parser’s API differs in quite a number of ways from that of the Universal Feed Parser and so quite a few of those tests—unadjusted— would fail. As Sam points out, there would be numerous advantages to (roughly) sharing an API with the Universal Feed Parser, particularly in allowing programmers to easily switch between languages and in the fact that the documentation already written would also apply to XML_Feed_Parser which is (as yet) undocumented. I’m going to spend some time thinking through the implications of making some API adjustments to fit more closely, but I’d love input on how far I should go (is it worth breaking backwards compatibility?)

Solvent: Semantic data from almost any page

Spending a weekend in Chicago last month and looking for a non-starbucks coffee shop in the loop, I was frustrated to find that the otherwise very handy delocator.net didn’t have an option to limit a search to a radius of less than 5 miles or to plot a group of results on a map. We eventually gave up and went to one of the many Starbucks highly visible in our immediate vicinity.

Of course, I could have written a scraper to pull the data off delocator’s results page and produce a map from it, but it would likely have taken more than the 5 minutes I had available. What I needed was Solvent. According to its creators at the Simile project, Solvent is “a Firefox extension that helps you write Javascript screen scrapers for Piggy Bank” and their screencast displays someone solving exactly the problem I found myself faced with.

As the screencast shows, extracting data from any page that has some structure to it is as simple as firing up the plugin, highlighting a few lines and selecting an appropriate description for them. The interface will feel familiar to anyone who’s worked with javascript debuggers, and it only takes a couple of minutes to get the data off the page, into PiggyBank and—thanks to PiggyBank’s google maps integration—onto a map.

For those who are comfortable with the DOM and Javascript, this is a fantastic tool. Along with the growing suite of microformats and the Greasemonkey scripts Mark Pilgrim is writing to parse them, this project shows that we’re rapidly moving towards a world where a decentralized store of semantically-rich information is possible.

Simile even have a companion project, Semantic Bank, that provides long-term storage of the captured data. It would be nice if users were prompted to set up an account with that (or other semantic banks) when they first install Piggy Bank. Coupled with some UI developments to make both Solvent and Piggy Bank more accessible to the non-technical user, and we could quickly see publishing data to the Semantic Web become as simple as blogging.

Microformats and extensibility

I’ve been following the chatter over microformats (XFN, xFolk, hCalendar, and their kin) for some time, but having been having a hard time formulating a response to all the discussion. In particular, the discussion over at Ryan’s blog and some postings such as this one by Danny Ayers have triggered further thinking.

The idea of ‘emergent semantics’ is an appealing one, and as many have argued lower-case semantics are far more likely to be adopted by a broad sweep of the web development community in the short-term than are carefully constructed XML vocabularies, or RDF representations of resources. But at the same time I fear that this sort of format will delay adoption of ‘true’ Semantic Web technologies, and balk at the apparent lack of extensibility the microformats offer.

As I work on plans for some future web app development, RDF has become more and more appealing because it is decentralised and allows for the representation of complex relationships between items. If I need to attribute a property that my current vocabulary doesn’t support, there is a standard system of drawing in another namespace which my tools can automatically understand. By contrast, (X)HTML only allows for rudimentary relationships, and there is no standardised way of indicating within the document which vocabularies tools should expect.

That’s not to say that microformats aren’t useful. Now that we have a large community of developers building standards-compliant sites it makes sense to work towards standardised class names for certain page elements and types of content. Having developed a number of screen-scrapers (and suffered the pain when a non-standards redesign then obliterates all that work), I’d love to be rid of the need to re-code whenever a manager decides a layout needs a slight change. But it will be to the benefit of all of us if we ensure that that standardisation doesn’t distract us from improving tools for generating and interpreting RDF, simplifying content-negotiation options, and otherwise making the web more interoperable.