Data Management: A New Search Driver

November 4, 2008

Earlier today I reread “The Claremont Report on Database Research.” I had a few minutes, and I recalled reading the document earlier this year, and I wanted to see if I had missed some of its key points. This report is a committee written document prepared as part of an invitation only conference focusing on databases. I follow the work of several of the people listed as authors of the report; for example, Michael Stonebraker and Hector Garcia-Molina, among others.

One passage struck me as important on this reading of the document. On page 6, the report said:

The second challenge is to develop methods for effectively querying and deriving insight from the resulting sea of heterogeneous data…. keyword queries are just one entry point into data exploration, and there is a need for techniques that lead users into the most appropriate querying mechanism. Unlike previous work on information integration, the challenges here are that we do not assume we have semantic mappings for the data sources and we cannot assume that the domain of the query or the data sources is known. We need to develop algorithms for providing best-effort services on loosely integrated data. The system should provide some meaningful answers to queries with no need for any manual integration, and improve over time in a “pay-as-you-go” fashion as semantic relationships are discovered and refined. Developing index structures to support querying hybrid data is also a significant challenge. More generally, we need to develop new notions of correctness and consistency in order to provide metrics and to enable users or system designers to make cost/quality tradeoffs. We also need to develop the appropriate systems concepts around which to tie these functionalities.

Several thoughts crossed my mind as I thought about this passage; namely:

  1. The efforts by some vendors to make search a front end or interface for database queries is bringing this function to enterprise customers. The demonstrations by different vendors of business intelligence systems such as Microsoft Fast’s Active Warehouse or Attivio’s Active Intelligence Engine make it clear that search has morphed from key words to answers.
  2. The notion of “pay as you go” translates to smart software; that is, no humans needed. If a human is needed, that involvement is as a system developer. Once the software begins to run, it educates itself. So, pay as you go becomes a colloquial way to describe what some might have labeled “artificial intelligence” in the past. With data volume increasing, the notion of humans getting paid to touch the content recedes.
  3. Database quality in the commercial database sector could be measured by consistency and completeness. The idea that zip codes were consistent was more important than a zip code being accurate. With statistical procedures the value in a cell may be filled and it will include a score that shows the probability that the zip code is correct. Similarly, if one looks for the salary or mobile number of an individuals, these probability scores become important guides to the user.

ediscovery cost perception

“Pay as you go” computing means that the most expensive functions in a data management method have costs reduced because humans are no longer needed to do “knowledge work” required to winnow and select documents, facts, and information. The company able to implement “pay as you go” computing on a large scale will destabilize the existing database business sector. My research has identified Google as an organization employing research scientists who use the phrase “pay as you go” computing. Is this a coincidence or an indication that Google wants to leap frog traditional database vendors in the enterprise?

In the last month, a number of companies have been kind enough to show me demonstrations of next generation systems that take a query and generate a report. One system allows me to look at a sample screen, click a few options, and then begin my investigation by scanning a “trial report”. I located a sample Google report in a patent application that generates a dossier when the query is for an individual. That output goes an extra step and includes aliases used by the individual who is the subject of the query and a hot link to a map showing geolocations associated with that individual.

The number of companies offering products or advanced demonstrations of these functions means that the word search is going to be stretched even further than assisted navigation or alerts. The vendors who describe search as an interface for business intelligence are moving well beyond key word queries and the seemingly sophisticated interfaces widely available today.

Despite the economic pressures on organizations today, vendors pushing into data management for the purpose of delivering business intelligence will find customers. The problem will be finding a language in which to discuss these new functions and features. The word search may not be up to the task. The phrase business intelligence is similarly devalued for many applications. An interesting problem now confronts buyers, analysts, and vendors, “How can we describe our systems so people will understand that a revolution is taking place?”

The turgid writing in the Claremont Report is designed to keep the secret for the in crowd. My hunch is that certain large organizations–possibly Google–are quite far along in this data management deployment. One risk is that some companies will be better at marketing than at deploying industrial strength next generation data management systems. The nest might be fouled by great marketing not supported by equally robust technology. If this happens, the company that says little about its next generation data management system might deploy the system, allow users to discover it, and thus carry the field without any significant sales and marketing effort.

Does anyone have an opinion on whether the “winner” in data management will be a start up like Aster Data, a market leader like Oracle, or a Web search outfit like Google? Let me know.

Stephen Arnold, November 4, 2008

InfoBright: Open Source Shifts into a Higher Gear

November 4, 2008

I like the InfoBright technology. After cutting a deal earlier this summer, the company is one the move. You can read about the InfoBright-Talend tie up here. Without getting into the details which you can read about on each company’s Web site, the payoff is easier data transfer to an from the InfoBright system. The second announcement is the firm’s new connector for Pentaho. If you are not familiar with Pentaho, click here. Pentaho offers ETL or extraction, transaction, and loading services. The net of these two announcements is to make it easier for those wanting to use the InfoBright rough set technology to make that move. Rough sets are interesting methods, and their use can deliver both improved performance on some data operations and useful new insights.

Stephen Arnold, November 4, 2008

OrcaTec’s Truevert: Green and Semantic Search

November 4, 2008

Truevert, created by OrcaTecjust released a beta version of its vertical semantic search engine that returns “green”-topic results. The program pulls its search from the body of Yahoo! spidered pages, and it interprets results based on context within the documents rather than keyword counts and popularity of hits. The goal is to focus on environment-conscious related answers. The engine will also “learn” words, no tagging or taxonomy required.

Truevert is also touted to work in any language. We ran some test quereies. “Batteries” returned attbat.com, a site with eco-friendly batteries, as the top result. In German (“Batterien”), the first result was GRS, a company that deals with recycling batteries. But in Spanish (“bateria”), the top listing was a Wikipedia entry for a type of Brazilian musical group.

They want feedback; Send it to feedback@truevert.com.

Jessica Bratcher, November 4, 2008

Google Pressures Yellow Page Sector

November 4, 2008

When I was in France, the hotel Internet connection failed. There was no paper book of business listings. The print directories–hereinafter called yellow pages–are no longer available at my hotel. In Harrods Creek, I get a number of weird yellow page publications. I receive a listing of businesses in the east end of Louisville, but I don’t look at it. I think I have a listing of minority owned businesses as well. I also recall seeing a silver yellow pages. The idea for that directory was that I could find businesses that wanted to work with people my age.

My newsreader delivered to me a story which if true spells trouble for anyone in the yellow pages business not working with Google. I may be misreading this news story in the Sydney Morning Herald (Australia) here, so you can double check me. “Sensis Concedes Defeat to Google” by Asher Moses is a story that told me Google nosed into the yellow page business, disrupted it, and ended up with a deal that Sensis (an Australian directory publisher owned by the big telco Telstra) had to take or run out of oxygen.

Mr. Moses wrote:

Telstra’s Sensis has given up on competing with Google in online search and mapping, announcing today it would provide its Yellow business listings to Google Maps and abandon its own search engine for one powered by Google. From the first quarter of next year, all of Yellow’s business listings – the most comprehensive directory in Australia – will be stored in Google Maps.

Will Google have the same success in North America? In my new Google and Publishing monograph for Infonortics, I explain how Google is building its own directory of businesses and providing free coupons to help the merchants get traffic. Years ago I worked on the USWest Yellow Pages’ project. One fact I recall was that a surprising number of yellow page advertisers don’t like the yellow pages.

In my experience, when there is a potent free service that is better than the existing service, the better service will disrupt existing business processes and then supplant them. So, if Google squeezed Sensis’ owner to get its way, Google will probably find the same pattern repeating itself in other markets. In short, the GOOG will become the yellow page champion. It’s just a matter of time. Regulators have a tough time understanding what Google does or why any single action is a problem. Bananas are a monoculture I think. That’s no problem, but there is just one type of banana. I have a list of other business sectors at risk for a Sensis type play by the GOOG. No one seems to care. The Google is just so darn fun.

Stephen Arnold, November 4, 2008

Google and the Washington Post’s Use of the Word Monopoly

November 4, 2008

Does anyone care about the library market? There are some specialist firms, mostly stuck in the sub-$1.0 billion a year basement. These firms have been aggregated by Thomson and Reed Elsevier to create $7.0 to $9.0 billion revenues streams, but these outfits have been challenged to grow rapidly and find new markets. There is the leveraged Cambridge Scientific Abstracts, a company hoping that Google acquires it, giving the owner a big payday. And there is the dwindling number of highly specialized firms trying to survive in a world where library budgets are no longer automatically increased.

The Washington Post story by James Gibson “Google New Monopoly” leap frogs over the shallow analyses of Google “doing good”. Mr. Gibson goes to the heart of Google’s deal with publishers for book scanning. He wrote here:

By settling the case, Google has made it much more difficult for others to compete with its Book Search service. Of course, Google was already in a dominant position because few companies have the resources to scan all those millions of books. But even fewer have the additional funds needed to pay fees to all those copyright owners. The licenses are essentially a barrier to entry, and it’s possible that only Google will be able to surmount that barrier. Sure, Google now has to share its profits with publishers. But when a company has no competitors, there are plenty of profits to share.

I find this interesting and potentially troublesome for Google for three reasons:

  1. The Washington Post editors knowingly characterized the book scanning, optical character recognition, and the other bits and pieces of this Google operation as a monopoly. The way I read the word “new”, the Post editors accept as common knowledge Google’s possession of at least one other monopoly, maybe more.
  2. The companies in the library world are likely to face the grim prospect of Google picking off information domains one by one. Google already has a patent service, which is bad news for some vendors. Maybe Derwent and Questel will be okay, but the pressure will mount for smaller fish. Google for its part is probably unaware of the library ecosystem that it will disrupt, but the disruption has begun for Ebsco and HW Wilson unless these firms can innovate and pump revenues quickly.
  3. Traditional library vendors have largely failed to keep pace with technology and consistently priced their products so that most people wanting information cannot access these data directly. In Harrods Creek, I have to drive to the public library in downtown Louisville to access some information resources. Google is going to make more and more of this high value information available to me and others directly. Whether ad supported or on a subscription basis, I will buy from Google.

The bottomline, therefore, is that if one wants to make a case that Google is on the path to more than two monopolies, the Washington Post makes that story clear in my opinion. Frankly none of the traditional information vendors can slow or impede Google. Google won’t have to buy all of the companies, but it may buy one or two to get some expertise and certain content domains. For the rest of this small, but important industry, the writing is on the wall. Change or be pushed into the lumber room.

Stephen Arnold, November 4, 2008

Azure: Wit and Optimism

November 3, 2008

I enjoy the Register. The addled goose wishes he were British so his edge has that Pythonesque slant. The article “What Ray Ozzie Didn’t Tell You about Microsoft Azure” is enjoyable and informative. It contains a wonderfully terse description of Amazon’s and Google’s cloud services. The Google description is quite tasty with the spice of shifting the job of figuring out how to use it from Google to the user. Nice point. You must read the full article here. Despite the wit, the write up makes several significant points:

  1. The cloud is a ragout, not a cohesive system or strategy
  2. A developer has to deal with a large number of components
  3. Microsoft might pull this off because Ray Ozzie has worked in this for a long time.

My view is more conservative. I want to see Microsoft deliver before I get too excited or put odds on Microsoft’s chances for success. Google’s been working on its cloud services for a decade, and those are not without their faults. I couldn’t access my Gmail for a time, then my “ig” page was dead too. Whether you agree with me or the Register is secondary to the useful information in its write up.

Stephen Arnold, November 3, 2008

Azure as Manhattan Project

November 3, 2008

I usually find myself in agreement with Dan Farber’s analyses. I generally agree with his “Microsoft’s Manhattan Project” write up here. Please, read his article, because I can be more skeptical about Microsoft’s ability to follow through with some of its technical assertions. It is easy for a Microsoft executive to say that software will perform a function. It is quite a different thing to deliver software that actually delivers. Mr. Farber is inclined to see Microsoft’s statements and demos about Microsoft Azure as commitment. He wrote:

Microsoft’s cloud computing efforts have gotten off to a slow start compared with competitors, and it’s on the scale of a Manhattan Project for Windows. Azure is in pre-beta and who knows how it will turn out or whether consumers and companies will adopt it with enough volume to keep Microsoft’s business model and market share intact. But there is no turning back and Microsoft has finally legitimized Office in the cloud.

My take is similar but there is an important difference between what Microsoft is setting out to do and what Google and Salesforce.com, among others, have done. Specifically, Google and Salesforce.com have developed new applications to run in a cloud environment. Google has many innovations, including MapReduce and Salesforce.com has its multi tenant architecture.

Microsoft’s effort will, in part, involve moving existing applications to the cloud. I think this is going to be an interesting exercise. Some of these targeted for the cloud applications like SharePoint have their share of problems. Other applications do not integrate well in on premises locations so those hiccups have to be calmed.

The big difference between Azure and what Google and other Microsoft competitors are doing may be more difficult than starting from ground zero. Unfortunately, time is not on Microsoft’s side. Microsoft also has the friction imposed by the bureaucracy of a $60.0 billion company. Agility and complexity may combine to pose some big challenges for the Azure Manhattan Project. The Manhattan Project was complex but focused on one thing. Microsoft’s Azure by definition has to focus on protecting legacy applications, annuity revenue, and existing functions in a new environment. That’s a big and possibly impossible job to get right on a timeline of a year and a half.

Stephen Arnold, November 3, 2008

Another View of the Search Market

November 3, 2008

I missed this September 27, 2008, analysis in Intelligent Enterprise. I am not surprised that my trusty correspondents did not forward the link to me. You must read “Enterprise Search: Microsoft, Google, Specialized Players Vie for Supremacy” by Andrew Conry-Murray” here. The article was interesting because it comes at a subject near and dear to my heart in a way that I would not have anticipated. This is a five part opus, so plan to spend some time analyzing the write up’s structure and assertions.

The thesis makes one key assumption; namely, enterprise search is alive and kicking and that it is a viable business sector for the hundreds of companies touting their search systems. First, Mr. Conry-Murray uses a segmentation developed by Information Week. That’s okay, but I am not certain it is 100 percent in line with my analysis of this complicated, confused, conflicted sector. Second, the article pops from finding stuff to finding stuff under the umbrella of eDiscovery. The leap doesn’t resonate with me, and it does not make much sense. eDiscovery can exist along with multiple search systems, and it involves some different issues that searching for stuff without threat of a fine or jail time. Think spoliation. Third, I was exposed to “the 17 databases problem”. Now, next generation data management systems can cope with heterogeneous types of structured and unstructured data. I quite like Google’s dataspace approach and the Exalead system works like a champ as well. Mark Logic and others are in this horse race as well. I could list more vendors but I don’t want to rehash my profiles in my Beyond Search study published by The Gilbane Group in April 2008. Finally, I learned about expert search.

I am not going to be able to recycle much of this article. Nor will I reference it in my lectures next week. What I learned is that a person who “reads up” about search and talks to some people can identify some of the issues. What’s missing is context. I do quite like the frequency with which the “beyond” preposition is turning up. There’s a “beyond Google” seminar. Even Attivio uses the “beyond” word in its newest white paper.

Here’s an interesting exercise. Navigate to Google. Run a query for “beyond search”. Start there.

Stephen Arnold, November 3, 2008

Exalead: Voice to Text

November 3, 2008

A happy quack to the stylish Parisian who alerted me to Exalead’s voice to text demonstration. To use the service, navigate to http://labs.exalead.com or click here. I entered several test queries and looked at the quality of the ASCII. I was impressed. I was able to get useful hits on my trusty query ‘bush and iraq”. My Google queries worked well too. Keep in mind that the system has processed a chunk of audio and video. The voices in the files are converted, indexed, and made searchable. One nifty feature is that if a video contains several references to the query term, an icon on the play bar allowed me to jump from relevant comment to relevant comment. No more serial listening to talking heads. Two happy quacks for the Exalead engineers who worked on this demo. Several other nice touches warrant highlighting:

  1. The system can parse a query such as ‘show me videos about iraq’
  2. Entities are automatically extracted and displayed in a side bar for assisted navigation
  3. A tab allows you to limit your query to audio, video, video on demand, or the entire suite of content.

For me, the most useful feature was the ability to click the ‘text’ link and see the transcribed text of the news show. Here’s a snippet of the machine converted and transcribed text:

the big apple behind the turntable strolling down the house makes tonight in chicago is craig alexander find your way to the bone bloomer whom you’ve gone and only together since the first of the year the brian james van by achieving their goal of crafting and plain old b. s. rock and roll the show tonight is that the hurricane in kansas city that’s a for tonight’s live music on the east coast air midwest for a look at what’s gone down monday night in the south boston that soars southern music reporter john spellman

My recommendation to Exalead is to start processing more content. I would love to have a transcript of the Google lecture series. A collection of security podcasts would be really useful. I don’t like to listen to 50 minutes of lousy audio to find one or two useful chunks of information.

I usually try to remind the French that folks from Kentucky know how to cook chicken correctly. None of coq au vin stuff. We use lard and whatever is growing behind the compost heap. But in this case, I won’t make any reference to cuisine. I will just say, “Voice to text… well done.”

Stephen Arnold, November 3, 2008 from somewhere in Europe

Attivio Scientist on the Changes in Enterprise Search

November 3, 2008

If you are looking for a summary of some of the changes forced upon vendors of enterprise search systems, you will enjoy “Beyond Findability: The Search for Active Intelligence.” The article is the work of Attivio’s Jonathan Young, an engineer and inventor who works at Attivio. I interviewed Attivio’s founder Ali Riaz here in May 2008. I have noticed that the word “beyond” is becoming shorthand for explaining that there is more needed than key word retrieval, Google, or business intelligence that requires a PhD in statistics to figure out.

“Beyond” also signals dissatisfaction with most search, content processing, and text analysis systems. The point, according to Mr. Young is that some innovations like semantic search impose significant burdens on an organization in machine resources and time.

The fix, he asserts here, is to provide “unified information access.” The idea is to make it easy to access data in a database table with unstructured information in an email. Mr. Young asserts:

The good news is that the predicted convergence of the database and search worlds is leading to some significant improvements in the search experience. As we move beyond the search box (the “user interface of last resort”), enterprise search solutions are beginning to support many different search modalities, including exploratory search, information discovery, and information synthesis. Navigation solutions are multiplying. Faceted search is already commonplace at major e-commerce sites.

If you want to move “beyond findability”, you can get a glimpse of the future by looking at the Attivio Web site. Has the future of search finally arrived. I think the newer systems are moving in the right direction, and progress is being made. I am encouraged that vendors are adapting to user needs. That’s encouraging. Even more interesting to me is that Mr. Young hits upon many of the themes that I have been addressing in this Web log, making “beyond” a candidate for a code word to mean that traditional key word retrieval is not enough for today’s knowledge worker.

Stephen Arnold, November 2, 2008

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta