I’m currently building a site where users will enter details of books and music they’re listening to and we want to provide lists of that on their profile and also find ways of matching users based on those choices. We’re looking at a number of ways of doing that, including matching based on genre, and in order to achieve that I needed a way to identify the genres for their listed books and music.
I knew that amazon provided that data through their Associate Web Services (formerly E-Commerce services) so I fired up the trusty amazon-ecs gem and pulled down the data for a couple of books and CDs. eg.
res = Amazon::Ecs.item_search(asin, :search_index => 'Music', :country => 'uk', :response_group => 'Large')
Looking at the results I see that amazon assign a varying number of ‘browsenodes’ to each item. browsenodes are a flexible system that allows amazon to associate arbitrary hierarchical data to items in their catalogue. They describe them in the documentation as:
Browse nodes are categories into which items for sale are organized. A single node might have many items associated with it. In the above example, the child node, “Boxed Sets,” might have the items “Abott and Costell Collection,” and “Laurel and Hardy Collection” associated with it.
The number of items associated with a browse node can change radically over time as items are added for sale, or as items go out of stock and are no longer sold. For example, for the browse node, TopSellers, items are attached and unattached according to their sales.
Even browse nodes themselves are created and deleted as items demand. When, for example, a new toy starts selling briskly, there may not be a node that appropriately categorizes the toy. In that case, a node would be created and the toy would be associated with the node. Then, if the sale of the toy died out, the node might be deleted. Other nodes are much longer lived. Top level nodes, for example, “Books” and “Apparel,” have remained unchanged for years.
That means that an album may have browsenodes giving us values of “Pop, Jazz, and Hard Bop”, but the same item may also have browsenodes with names like “Bestsellers”, “40% off sale”, and all sorts of other useless data. Thankfully it’s fairly easy to identify that genre data all sits under “Styles” (for music) and “Subjects” (for books) which have IDs 520920 and 1025612 respectively in the UK.
Once I had those parents identified it was easy enough to extend the Amazon::Element class to add a method to read out the appropriate browsenodes and return them:
class Amazon::Element
def extract_genre_children_of(browse_node_id)
browse_nodes = elem.search('browsenodes>browsenode')
genres = browse_nodes.inject([]) do |collection, bnode|
if bnode.search("//ancestors//browsenode/browsenodeid[text()='#{browse_node_id}']").length > 0
collection += bnode.search('name').collect(&:inner_text)
end
collection
end
genres.uniq
end
end
That code’s far from perfect - it returns a few higher level items I subsequently filter out - but it’s doing the job for me.
Of course, the resulting data is variable in quality. I certainly wouldn’t call Battles’ Mirrored “Adult Contemporary”, but it’s a start towards finding some genres automatically and hopefully as we build up a large enough store of data it will begin to become more useful. In the meantime, we’re also experimenting with musicbrainz to pull in some extra data from their tags.