Data Management: A New Search Driver

November 4, 2008

Earlier today I reread “The Claremont Report on Database Research.” I had a few minutes, and I recalled reading the document earlier this year, and I wanted to see if I had missed some of its key points. This report is a committee written document prepared as part of an invitation only conference focusing on databases. I follow the work of several of the people listed as authors of the report; for example, Michael Stonebraker and Hector Garcia-Molina, among others.

One passage struck me as important on this reading of the document. On page 6, the report said:

The second challenge is to develop methods for effectively querying and deriving insight from the resulting sea of heterogeneous data…. keyword queries are just one entry point into data exploration, and there is a need for techniques that lead users into the most appropriate querying mechanism. Unlike previous work on information integration, the challenges here are that we do not assume we have semantic mappings for the data sources and we cannot assume that the domain of the query or the data sources is known. We need to develop algorithms for providing best-effort services on loosely integrated data. The system should provide some meaningful answers to queries with no need for any manual integration, and improve over time in a “pay-as-you-go” fashion as semantic relationships are discovered and refined. Developing index structures to support querying hybrid data is also a significant challenge. More generally, we need to develop new notions of correctness and consistency in order to provide metrics and to enable users or system designers to make cost/quality tradeoffs. We also need to develop the appropriate systems concepts around which to tie these functionalities.

Several thoughts crossed my mind as I thought about this passage; namely:

  1. The efforts by some vendors to make search a front end or interface for database queries is bringing this function to enterprise customers. The demonstrations by different vendors of business intelligence systems such as Microsoft Fast’s Active Warehouse or Attivio’s Active Intelligence Engine make it clear that search has morphed from key words to answers.
  2. The notion of “pay as you go” translates to smart software; that is, no humans needed. If a human is needed, that involvement is as a system developer. Once the software begins to run, it educates itself. So, pay as you go becomes a colloquial way to describe what some might have labeled “artificial intelligence” in the past. With data volume increasing, the notion of humans getting paid to touch the content recedes.
  3. Database quality in the commercial database sector could be measured by consistency and completeness. The idea that zip codes were consistent was more important than a zip code being accurate. With statistical procedures the value in a cell may be filled and it will include a score that shows the probability that the zip code is correct. Similarly, if one looks for the salary or mobile number of an individuals, these probability scores become important guides to the user.

ediscovery cost perception

“Pay as you go” computing means that the most expensive functions in a data management method have costs reduced because humans are no longer needed to do “knowledge work” required to winnow and select documents, facts, and information. The company able to implement “pay as you go” computing on a large scale will destabilize the existing database business sector. My research has identified Google as an organization employing research scientists who use the phrase “pay as you go” computing. Is this a coincidence or an indication that Google wants to leap frog traditional database vendors in the enterprise?

In the last month, a number of companies have been kind enough to show me demonstrations of next generation systems that take a query and generate a report. One system allows me to look at a sample screen, click a few options, and then begin my investigation by scanning a “trial report”. I located a sample Google report in a patent application that generates a dossier when the query is for an individual. That output goes an extra step and includes aliases used by the individual who is the subject of the query and a hot link to a map showing geolocations associated with that individual.

The number of companies offering products or advanced demonstrations of these functions means that the word search is going to be stretched even further than assisted navigation or alerts. The vendors who describe search as an interface for business intelligence are moving well beyond key word queries and the seemingly sophisticated interfaces widely available today.

Despite the economic pressures on organizations today, vendors pushing into data management for the purpose of delivering business intelligence will find customers. The problem will be finding a language in which to discuss these new functions and features. The word search may not be up to the task. The phrase business intelligence is similarly devalued for many applications. An interesting problem now confronts buyers, analysts, and vendors, “How can we describe our systems so people will understand that a revolution is taking place?”

The turgid writing in the Claremont Report is designed to keep the secret for the in crowd. My hunch is that certain large organizations–possibly Google–are quite far along in this data management deployment. One risk is that some companies will be better at marketing than at deploying industrial strength next generation data management systems. The nest might be fouled by great marketing not supported by equally robust technology. If this happens, the company that says little about its next generation data management system might deploy the system, allow users to discover it, and thus carry the field without any significant sales and marketing effort.

Does anyone have an opinion on whether the “winner” in data management will be a start up like Aster Data, a market leader like Oracle, or a Web search outfit like Google? Let me know.

Stephen Arnold, November 4, 2008


5 Responses to “Data Management: A New Search Driver”

  1. Kingsley Idehen on November 4th, 2008 7:47 am


    The whole concept of Data Access and Data Management is changing. We are moving to an era where the “Concept oriented Entity Model” becomes a concrete focal point for data acccess and data management.

    I write about the above in relation to the applicability of the RDF based Linked Data meme emerging out of what used to be known solely as “Semantic Web Technology”.

    Here are some links you may find interesting (in addition to my blog link above):

    1. – Lot of coverage of new DBMS technology frontier
    2. – a presentation I gave a Linked Data Planet (*remixed edition*)


  2. Stephen E. Arnold on November 4th, 2008 11:14 am

    Kingsley Idehen

    Thanks for your thoughtful post. Please, continue to share your viewpoint.

    Stephen Arnold, November 4, 2008

  3. Steve Wooledge on November 4th, 2008 12:57 pm


    It’s also interesting to see that innovations developed by search companies like Google are making their way into more traditional relational databases. The Claremont Report also highlights the importance of declarative programming constructs like MapReduce.

    Initially popularized by Google for unstructured data (and later used at companies like Yahoo! or Facebook via the open-sourced Hadoop), there is also tremendous power and flexibility for a whole new class of developers to get rich insight from large volumes of *structured* data.

    Here’s more detail:

    [1] The Aster Data Systems take of the Claremont report:

    [2] More about In-Database MapReduce and how it’s different from Hadoop, user-defined functions (UDFs), stored procedures, etc.:


  4. Stephen E. Arnold on November 4th, 2008 2:59 pm

    Steve Wooledge,

    Thanks for your thoughtful comment. Feel free to air your ideas here.

    Stephen Arnold, November 4, 2008

  5. Unpublished comment regarding beyond search blog on November 12th, 2008 6:34 am

    [...] So this is a Re-post of what I was thinking when reading his post from Nov 4, 2008: [...]