Google Forms: A Data Snout for a Bigger Creature
April 12, 2008
Navigate to Google’s Webmaster Central Blog. Scan the posting written by two wizards whom you probably don’t know, Alon Halevy (senior wizard) and Jayant Madhavan (slightly less senior wizard). Here’s what you will be told in well-chosen, Googley prose:
In the past few months we have been exploring some HTML forms to try to discover new web pages and URLs that we otherwise couldn’t find and index for users who search on Google. Specifically, when we encounter a <FORM> element on a high-quality site, we might choose to do a small number of queries using the form. For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes, and radio buttons on the form, we choose from among the values of the HTML. Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting, and includes content not in our index, we may include it in our index much as we would include any other web page.
The idea is that dynamic content does not usually appear in an index. On the public Internet, this type of content is useful to me. For example, when I want to take a Southwest flight, I have to fill in some annoying Southwest forms, fiddle with drop down boxes, and figure out exactly which fare is likely to let me sit in one of the “choice” seats by boarding first. Wouldn’t it be great to be able to run a query on Google, see the flights aggregated, and from that master list jump to the order form? Dynamic content is now becoming more common.
I heard from one wizard at a conference in London that dynamic content is now more than half of the content appearing on the Web. The shift from static to dynamic is, therefore, a fundamental change in the way Web plumbing works on Web log content management systems to the sprawling craziness of Amazon.com.
A diagram from Dr. Guha’s patent applications with the Context Server shown in relation to the other parts of the PSE. This is a figure from Google Version 2.0: The Calculating Predator, published by Infonortics, Ltd., Tetbury, Glou. in July 2007. Infonortics holds the copyright to this study and its contents.
You may want to poke around in Google’s five patent applications for Ramanathan Guha’s Programmable Search Engine. Before the meltdown, I worked with BearStearns to document the technologies of the PSE and provide some examples of how it allows Google to perform some nifty “context” functions; that is, the ability of Google to accept explicit instructions from Web site operators with structured data and dynamic sites regarding what can and cannot be done with the data. If this seems like the next step for forms, you are thinking along the same lines I am.
I don’t think “context” in the PSE means just what you want your smartphone displaying as you rush to catch a flight. This is part of what Google’s engineers in an open source document called “I’m feeling doubly lucky”; that is, search performed automatically based on what you are doing and likely to need at a specific point in time.
Context applies to data (ergo the link to “forms”) and the machine processes operating when certain actions occur. Context at Google is a many splendid thing assuming I read these Google patent applications correctly:
- US2007 00386616, filed on April 10, 2005, and published on February 15, 2007 as “Programmable Search Engine”
- US2007 0038601, filed on August 10, 2005, and published on February 15, 2007, as “Aggregating Content Data for Programmable Search Engines”
- US2007 0038603, filed on August 10, 2005, and published on February 15, 2007, as “Sharing Context Data across Programmable Search Engines”
- US2007 0038600, filed on August 10, 2005, and published on February 15, 2007, as “Detecting Spam-Related and Biased Contents for Programmable Search Engines”
- US2007 0038614, filed on August 10, 2005, and published on February 15, 2007, as “Generating and Presenting Advertisements Based on Context Data from Programmable Search Engines”.
You can download these exceptionally interesting and seminal documents from the USPTO’s wacko online search and retrieval system. Tip: read the syntax examples; otherwise, you won’t get your documents.
Here’s a snippet of my discussion of the PSE from Google Version 2.0:
The PSE is not a revolutionary invention for Google. The Guha inventions use bits and pieces of other Google inventions, the BigTable database, the Google File System, the Google operating environment, and the Googleplex’s high-speed, distributed computing system. What’s significant about the PSE is that it indicates that Google wants to remain on top in search and advertising. Furthermore, in order to make its other big plays have a better chance of succeeding, Google has to solve the problems that plague its “old” PageRank-based system. Equally important, Google must respond to the dramatic increase in dynamic Web sites and the users’ hunger for mashups and useful applications based on finding, slicing, and dicing Web content into useful, interesting, and entertaining new forms. (From Stephen E. Arnold’s Google Version 2.0, “Google and the Programmable Search Engine, July 2007).
Outlines Now Discernable
The pieces are starting to become more visible to developers: BigTable in training wheels mode, the App Engine with a venturi limiting what can flow through the App Engine, and the REST search API, et al. FORMS is not a trivial Google function; FORMS is one part of the umbrella technology that Nimble explored years ago. There are hundreds of posts about Google’s recent announcements, so FORMS is likely to get lost.
The key point is that Google is putting something far larger in place, and it is interesting to me how the many gurus, SEO mavens, and Google observers nibble around the edges of the cookie as Google builds a giant digital confection factory. Scale is the problem. Google thinks big. Technology is a problem. Google is a little more sophisticated than most of the MBAs and Google watchers perceive. It’s hiding in plain sight in FORMS, App Engine, the REST API, Gadgets, and other “betas” which are inconsequential when considered individually. My work is focused on trying to figure out how these pieces fit together. FORMS is a useful clue.
Now PSE is not referenced in the Drs. Halevy and Madhavan post, but you may be able to find some of their technical work on the Web. When I poke around for data about these wizards’ research, I find some interesting pointers to work at Stanford University, companies such as Nimble Technology, and interesting papers such as this one “Structured Data Meets the Web: A Few Observations”.
More about PSE and FORMS
If you are interested in the business implications of this “FORMS” announcement, you will find a deeper discussion of the subject in my Google Version 2.0 (Infonortics, 2007) and Beyond Search (Gilbane Group, 2008). In Google Versiin 2.0 I dig into the the work of Dr. Halevy and his team at Google. As far as I know, there’s not too much analysis of these technology tributataries. It was amusing to see the 24 hour reaction to BearStearns’ report “Google and the Semantic Web”.
A forthcoming KMWorld column takes a deeper look at the Google App engine with a few references to the PSE and the FORMS announcement. I think the publication date for that piece will be June 2008, and it will be available at the KMWorld Web site upon publication.
What’s a snout? A snout is a nose or a proboscis like an elephant’s. The snout can be poked into an area, sniff, and inhale intelligence. The question is, “What the creature going to do after the first sniff or two?
Stephen Arnold, April 12, 2008