Good Enough Means Trouble for Commercial Database Publishers

May 28, 2008

I began work on my new Google monograph. (I’m loath to reveal any details because I just yesterday started work on this project.) I will be looking at Google’s data management inventions in an attempt to understand how Google is increasing its lead over search rivals like Microsoft and Yahoo while edging ever closer to providing data services to organizations choking on their digital information.

As part of that research, I came across several open source patent documents that explain how Google uses the outputs of several different models to determine a particular value. Last week a Googler saw my presentation which featured a Google illustrative output from a patent application and, in a Googley way, accused me of creating the graphic in Photoshop.

Sorry, chipper Googler, open source means that you can find this document yourself in Google if you know how to search. Google’s system is pretty useful for finding out information about Google even if Googlers don’t know how to use their own search system.

How does Google make it possible for my 86-year-old father to find information about the town in Brazil where we used to live and allow me to surface some of Google’s most closely-guarded secrets? These are questions worth considering. Most people focus on ad revenues and call it a day. Google’s a pretty slick operation, and ads are just part of the secret sauce’s ingredients.

Running Scenarios In my experience, it’s far more common for a team to use a single model and then run a range of scenarios. The high and low scenario outputs are discarded, and the results are averaged. While not perfect, the approach yields a value which can be used as is or refined as more data become available to the system. Google’s twist is that different models generate an answer.

Incremental improvements pay off over time

This diagram shows how Google’s policy of incremental “learnings” allows one or more algorithms to become more intelligent over time.

The outputs of each model is mathematically combined with other models’ outputs. As I read the Google engineers’ explanations, it appears that using multiple models generates “good enough” results, and it is possible, according to the patent document I am now analyzing, to replace models whose data are out of bounds.

The idea is that instead of swapping values, Google sets up a mechanism that allows models to be added, tested, and discarded. The patent documents reference other Google papers, patent filings, and the work of other researchers. The technique, like so many Google innovations, is well-known if you know where to look. The Big Idea echoes a comment made by Dr. Peter Norvig, the artificial intelligence wizard.

In 2006, he said in a lecture at the University of California-Berkeley, “We test. We test a lot.” In that same lecture, he referenced Google’s performance on a US government search test. As I recall the comment, he said, “We were able to run multiple tests each day. We tweaked and found out that testing yielded better results because we could learn from the iterations.”

So testing and tweaking thresholds are important. But there’s another Google Big Idea. That insight is using multiple algorithms and averaging their outputs yields better results than than picking algorithms and stuffing different values through it and averaging those results.

Obvious. No single algorithm has to hit a home run. The use of different mathematical procedures and numerical recipes provide outputs that, when combined, are “good enough”. Combine the two approaches and the “good enough” solutions improve over time assuming you have what Google calls “big data”. As far as I know, Google has plenty of usage data, metadata, and process data to crunch. Big data gives Google a cushion on which to prop its systems.

What about High School Students?

On Tuesday, May 27, 2008, I read a thoughtful post by Christopher Dawson, “Google vs. Educational Databases”. You can read the post here. Mr. Dawson, a teacher, points out that professional librarians usually guide student to commercial databases for research.

The important paragraph for me is this one:

I’m not saying that Google is better than solid collections of well-researched primary sources. For an awful lot of students, educational databases can be an extraordinary resource. However, I worry when teachers are too quick to discount the thrill of the chase and the real value of having the better part of human knowledge a few good search terms away. Now, more than ever, it is essential that we teach kids to wisely use the incredible variety of resources at their disposal, whether that means Googling until they hit that perfect article, browsing Wikipedia for references and ideas, or seeking out actual people who can help them wade through all of the chaff in search of the wheat.

What this means is that a generation, maybe more, of online users are becoming habituated to the use of Google. Believe me, habituation in online usage is a potent force. In the 1980s, ABI / INFORM had become a business information source that professional librarians conducting business research turned to each time information was needed. The result was a staggering profit from the 42,000 records we produced each year. Even today, many professional researchers respond positively to the “old” ABI / INFORM, now managed by the CSA info-conglomerate. After almost 30 years, online habits can be hard to change.

What’s this have to do with incremental improvements by Google? Plenty. Here’s why:

  1. Commercial databases don’t put significant amounts of information in one file. Commercial database producers like professional publishers scatter information across multiple files and products. The customer has to buy more. The burden of federating and deduplicating is part of the pain. Today’s users won’t tolerate the hassle of the traditional vendors.
  2. Commercial databases reside in proprietary systems. A user must learn how to find information in these systems. An average executive looking for information in Chemical Abstracts, Investext, or Compendex won’t have an easy time formulating a basic query and making sense out of what the system returns. The same applies to legal information systems. The learning curve, once appreciated as part of the mystique of online, is simply out of step with today’s user.
  3. Cost, cost, cost. Run a query on Lexis or Westlaw. Pay upwards of a $100 whether the information is what you need or not germane. In today’s shrinking financial landscape, customers can’t and won’t pay usurious prices. Google has a different business model, and it means that a user doesn’t pay to search. The advertiser does. With a commercial database, the user pays and pays.

Among the leaders in this market segment are the troubled ProQuest, the ghost-like giant Ebsco Electronic Publishing, and the almost stealthy Thomson and Reed Elsevier. Most researchers have an incomplete view of the stranglehold these companies have on certain information sectors in the United States.

In point of fact, Lexis and then Westlaw were among the pioneers in getting lawyers habituated to use these expensive, proprietary systems. The cheerful red Lexis terminal became the way to perform exhaustive legal research at the nation’s law schools. When the lawyers passed the bar, it was easy for these individuals to slide into the for-fee world of LexisNexis. The habituation remains strong even today. Don Wilson, one of the founders of Lexis, used to joke with me about the addictive quality of ABI / INFORM and Lexis. We thought online files as a version of a cigarette habit was pretty funny.

Well, the tables may be turned. Mr. Dawson’s essay makes it clear to me that Google is hooking high school students on Google. When these folks find their way into professional life, the system of choice will be Google. Google’s sparse legal offerings will be richer and much better. Remember those incremental improvements engineered into the fabric of smart software.

Bottom Line

Here’s what’s ahead, and I know that anyone in the commercial database world will strongly disagree with me. My suggestion is to use the comments section and offer your arguments to me. Please, don’t have your Buffy and Scotty PR People call me for a briefing. I’m not going to listen to blather.

  1. Google’s push into schools like Mr. Dawson’s and universities like Arizona State are actions that will have consequences in three or four years. Today’s researcher and Gmail user becomes tomorrow’s buyer of Google enterprise services. In terms of search, you know that Google will be the preferred system. Commercial database systems will no longer be the first place to start research. These expensive, specialized files may be consulted under certain circumstances. Prices will have to rise, so usage will drop off. Growth ceases.
  2. Commercial database publishers are like red kangaroo in the headlights. The car is rushing toward the herd, but no one knows exactly what to do. The mortality rate will be high. The species may survive, but the habitat no longer makes life easy.
  3. No matter how lousy Google’s professional information services are at this point in time, these services will improve. That’s the point of the graphic in this article. Without significant competition, Google is running unchallenged. When a commercial database company buys a smaller competitor or when a traditional publisher jumps into business news, the actions are short-term tactics. These are purely defensive plays. Where’s the innovation? Exactly. There is none.

One day, “good enough” becomes the “best”. Google search has run the board. Can the same approach lead to dominance in educational research?

Related article: Slashdot: Large Web Host Urges Customers to Use Gmail

Stephen Arnold, May 28, 2008

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta