How Smart Is Google’s Software?

September 17, 2008

When you read this, I will have completed my “Meet the Guru” session in Utrecht for Eric Hartmann. More information is here. My “guru” talk is not worthy of its name. What I want to discuss is the relationship between two components of Google’s online infrastructure. This venue will mark the first public reference to a topic I have been tracking and researching for several years–computational intelligence. Some background information appears in the Ignorance Is Futile Web log here.

I am going to reference my analysis of Google’s innovation method. I described this in my 2007 study The Google Legacy, and I want to mention one Google patent document; specifically, US20070198481, which is about fact extraction. I chose this particular document because it references research that began a couple of years before the filing and the 2007 granting of the patent. It’s important in my opinion because it reveals some information about Google’s intelligent agents, which Google references as “janitors” in the patent application. Another reason I want to highlight it is that it includes a representation of a Google results list as a report or dossier.

Each time I show a screen shot of the dossier, any Googlers in the audience tell me that I have Photoshopped the Google image, revealing their ignorance of Google’s public patent documents and the lousy graphical representations that Google routinely places in its patent filings. The quality of the images and the cute language like “janitors” are intended to make it difficult to figure out what Google engineers are doing in the Google cubicles. Any Googlers curious about this image (reproduced below) should look at Google’s own public documents before accusing me of spoofing Googzilla. This now happens frequently enough to annoy me, so, Googlers, prove you are the world’s smartest people by reading your own patent documents. That’s what I do to find revealing glimpses such as this one display for a search of the bound phrase “Michael Jackson”:

The highlight boxes and call outs are mine. What this diagram shows is a field (structured) report or dossier about Michael Jackson. The red vertical box identifies the field names of the data and the blue rectangle points your attention to the various names by which Michael Jackson is known; for example, Wacko Jacko.

Now this is a result that most people have never seen. Googlers react to this in shock and disbelief because only a handful of Google’s more than 19,000 employees have substantive data about what the firm’s top scientists are doing at their jobs. I’ve learned that 18,500 Googlers “run the game plan”, a Google phrase that means “Do what MOMA tells you”. Google patent documents are important because Google has hundreds of US patent applications and patents, not thousands like IBM and Microsoft. Consequently, there is intent behind funding research, paying attorneys, and dealing with the chaotic baloney that is the specialty of the USPTO.

This patent document explains that Google has figured out how to create software machines that act intelligently. Now the machines–called “janitors”–are not perfect. So, instead of relying on a single approach, the patent document discloses that the system revealed relies of several different types of machines. The “janitor” cleans up any ambiguities or other “messes” created by the sister processes. The “janitor” can try different techniques for performing clean up. If baffled, the janitor can seek “help” or other inputs from other parts of Google’s infrastructure. Keep in mind that this clean up operates without human intervention.

The clean up is important because in order to produce a dossier, information processed by Google must be stored in a consistent manner. Many companies perform such transformation and clean up. What’s different in Google’s approach is scale. My research indicates that Google’s ability to operate at scale is one of its key competitive advantages.

The two components that are inter dependent are:

A computing platform that can provide the computational capability or horsepower needed to store, launch agents, manage iterative processes, and make the data available to other processes, systems, and users.
Software that is computationally intelligent so that the software can figure out how to handle an ambiguous situations without having to have a human near by to coddle the software agents.

I have to keep reminding myself that Google has a different view of its technology than some other companies do. I find it important to put Google’s technologies into this broader framework. What’s clear is that Google’s dossier (shown above) is just one of many applications of the janitor and his software friends.

The buzz about Chrome is greater than the interest in many other Google innovations. The question I want to raise in my guru session is, “Which is of greater significance to a person trying to understand Google’s competitive advantages: a method for displaying the Google technology or the Google systems and methods that are beneath the shiny, clickable surface?”

I don’t have an answer. I know the Googlers who accuse me of fabricating screenshots such as the dossier image above don’t know. So far, I have had a chance to pop this question to Google’s own gurus. That fact alone proves that I am not a guru at all. I don’t know how smart Google’s software is, but I think it is capable of getting much smarter as the janitors and his helpers go about their business.

Stephen Arnold, September 16, 2008

Written by Stephen E. Arnold · Filed Under Database, Feature, Google, Online (general), Search, Technology, Text analytics, Text processing

Comments

3 Responses to “How Smart Is Google’s Software?”

TinEye Image Search Engine on September 17th, 2008 4:56 am

[…] Smart is Google’s Software http://arnoldit.com/wordpress/2008/09/17/how-smart-is-googles-software/ Share and […]
Bob Carpenter on September 17th, 2008 1:47 pm

Why would you expect a company of ad salespeople to know about entity linking? Jests aside, Google’s big enough, diverse enough, and distributed enough that it’s a microcosm of the entire online research world. They’re innovating in everything from demographic modeling to Ajax. How’s one techie supposed to keep up with all that? I certainly couldn’t keep up with everything Bell Labs did while I was there.

Just because you patent something doesn’t mean you use it. Or even that you have the rights to use it (it may depend on another patent which you don’t own). We patented stuff all the time at Bell Labs that never saw the light of production. The patent attorneys told us researchers that they wanted a “mine field” of patents to trap anyone entering into the speech or language processing space.

As with most patents, the claims in US20070198481 are like an onion, layering well-known techniques with patent-speak jargon. I work in this area and don’t see anything new in the claims, which is typical of patents in my area.

Our Phase II SBIR from the U.S. National Institutes of Health is in this same area — extracting facts and linking them to databases. You’ll see similar patents out of IBM, BBN, SRI, PowerSet and Microsoft (I really like Silviu Cucerzan and Eric Brill’s work in this direction; it’ll make you wonder why MS bought PowerSet). You also see whole companies, like Spock.com, built around this technology or something very much like it. The U.S. National Institute of Standards just ran its 2008 Automatic Content Extraction evals (ACE), which is essentially a bakeoff format for this kind of fact extraction and object linking technology; here are results from ACE 2007.
Stephen E. Arnold on September 17th, 2008 2:41 pm

Bob Carpenter,

Good post. Keep ’em coming.

Stephen Arnold, September 17, 2008

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.