More Transformation Goodness from the Googleplex
July 8, 2008
In press is one of my for-fee write ups that talks about the black art of data transformation. I will let you know when it is available and where you can buy it. The subject of this for-fee “note” is one of the least exciting aspects of search and content processing. (I’m not being coy. I am prohibited from revealing the publisher of this note, the blue-chip company issuing the note, and any specific details.) What I can do is give you a hint. You will want to read this Web log post at Google Code: Open Source Google. News about Google’s Open Source Projects and Programs here. You can read other views of this on two other Google Web logs: The Official Google Web log here and Matt Cutts’s Web log here. You will also want to read the information on the Google project page as well.
The announcement by the Googley Kenton Varda, a member of the software engineering team, is “Protocol Buffers: Google’s Data Interchange Format”. Okay, I know you are yawning, but the DIF (an acronym for something that can chew up one-third of an information technology department’s budget) is reasonably important.
The purpose of a DIF is to take content (Object A in Format X) and via the magic of a method change that content into Format Y. Along the way, some interesting things can be included in the method. For example, nasty XML can be converted into little angel XML. The problem is that XML is a fat pig format and fixing it up is computationally intensive. Google, therefore:
developed Protocol Buffers. Protocol Buffers allow you to define simple data structures in a special definition language, then compile them to produce classes to represent those structures in the language of your choice. These classes come complete with heavily-optimized code to parse and serialize your message in an extremely compact format. Best of all, the classes are easy to use: each field has simple “get” and “set” methods, and once you’re ready, serializing the whole thing to – or parsing it from – a byte array or an I/O stream just takes a single method call.
The approach is sophisticated and subtle. Google’s approach shaves with Occam’s Razor, and the approach is now available to the Open Source community. Why? In my opinion, this is Google’s way of cementing its role as the giant information blender. If protocol buffers catch on, a developer can slice, dice, julienne, and chop without some of the ugly, expensive, hand-coded stuff the “other guys’s approach” forces on developers.
There will be more of this type of functionality “comin’ round the mountain, when she comes,” as the song says. When the transformation express roars into your town, you will want to ride it to the Googleplex. It will work; it will be economical; and it will leapfrog a number of pitfalls developers unwittingly overlook.
Stephen Arnold, July 8, 2008