Lotsa Search at Yahoo!

February 3, 2008

Microsoft’s hostile take over of Yahoo! did not surprise me. Rumors about Micro – hoo or Ya – soft have floated around for a couple of years. I want to steer clear of the newsy part of this take over, ignore the share-pumping behind the idea that Mr. Murdoch will step in to buy Yahoo, and side step Yahoo’s 11th hour “we’re not sure we want to sell” Web log posting.

I prefer to do what might be called a “catalog of search engines,” a meaningless exercise roughly equivalent to Homer’s listing of ships in The Illiad. Scholars are still arguing about why he included the information and centuries later continue to figure out who these guys were and why such an odd collection of vessels was necessary. You may have a similar question about Yahoo’s search fleet after you peruse this short list of Yahoo “findability” systems:

  • InQuira. This is the Yahoo natural language customer support system. InQuira was formed from three smaller search outfits that ran aground. InQuire seems stable, and it provides NLP systems for customer support functions. Try it. Navigate to Yahoo. Click Help and ask a question, for example, “How do I cancel my premium mail account?” Good luck, but you have an opportunity to work with an “intelligent” agent who won’t tell you how to cancel a for-fee Yahoo service. When I learned of this deal, I asked, “Why don’t you just use Inktomi’s engine for this?” I didn’t get an answer. I don’t feel too bad. Google treats me the same way.
  • Inktomi. Yahoo bought this Internet indexing company in 2002. We used the Inktomi system for the original US government search service, FirstGov.gov (now USA.gov). The system worked reasonably well, but once in the Yahooligans’ hands, not much was done with the system, and Inktomi was showing its age. In 2002, Google was motoring just drawing even with Yahoo. Yahoo seemed indifferent or unaware that search had more potential than Yahoo’s portal approach.
  • Stata Labs. When Gmail entered semi-permanent beta, it offered two key features. First, there was one gigabyte of storage and, two, you could search your mail. Yahoo couldn’t search email at all. The fix was to buy Stata Labs in 2004. When you use the Yahoo mail search function, the Stata system does the work. Again I asked, “Why not use one of your Yahoo search systems to search mail?” Again, no response.
  • Fast Search & Transfer. Yahoo, through the acquisition of Overture, ended up with the AllTheWeb.com Web site. The spidering and search technology are operated by Fast Search & Transfer (the same outfit that Microsoft bought for $1.2 billion in January 2008). Yahoo trumpeted the “see results as you type feature” in 2007, maybe 2006. The idea was that as you key your query, the system shows you results matching what you have typed. I find this function distracting, but you may love it. Try it yourself here. I heard that Yahoo has outsourced some data center functions to Fast Search & Transfer, which, if true, contradicts some of the pundits who assert that Yahoo has its data center infrastructure well in hand. If so, why lean on Fast Search & Transfer?
  • Overture. When Yahoo acquired Overture (the original pay-for-traffic service) in 2003, it got the ad service and the Overture search engine. Overture purchased AllTheWeb.com and ad technology from Fast Search & Transfer. When Yahoo bought Overture, Yahoo inherited Overture’s Sun Microsystems’ servers with some Linux boxes running a home brew fraud detection service, the original Overture search system, and the AllTheWeb.com site. Yahoo still uses the Overture search system when you look for key words to buy. You can try it here. (Note: Google was “inspired” by the Overture system, and paid about $1.2 billion to Yahoo to avoid a messy lawsuit about its “inspiration” prior to the Google IPO in 2004. Yahoo seemed happy with the money and did little to impede Google.)
  • Delicious. Yahoo bought Delicious in 2005. Delicious came with its weird url and search engine. If you have tried it, you know that it can return results with some latency. When it does respond quickly, I find it difficult to locate Web sites that I have seen. As far as I know, the Delicious system still uses the original Delicious search engine. You can try it here.
  • Flickr. Yahoo bought Flickr in 2005, another cog in its social, Web 2.0 thing. The Flickr search engine runs on MySQL. At one trade show, I heard that the Flickr infrastructure and its search system were a “problem”. Scaling was tough. Based on the sketchy information I have about Yahoo’s search strategy, Flickr search is essentially the same as it was when it was purchased and is in need of refurbishing.
  • Mindset. Yahoo, like Google and Microsoft, has a research and development group. You can read about their work on the recently redesigned Web site here. If you want to try Mindset, navigate to Yahoo Research and slide the controls. I’ve run some tests, and I think that Mindset is better than the “regular” Yahoo search, but it seems unchanged over the last six or seven months.

I’m going to stop my listing of Yahoo’s search systems, although I could continue with the Personals search, Groups search, News search, and more. I may comment on AltaVista.com, another oar in Yahoo’s search vessel, but that’s a topic that requires more space than I have in this essay. And I won’t beat up on Yahoo Shopping search. If I were a Yahoo merchant, I would be hopping mad. I can’t figure out how to limit my query to just Yahoo merchants. The results pages are duplicative and no longer useful to me. Yahoo has 500 million “users” but Web statistics are mushy. Yahoo must be doing something right as it continues to drift with the breeze as a variant of America Online.

In my research for my studies and journal articles, I don’t recall coming across a discussion of Yahoo’s many different search systems. No one, it seems, has noticed that Yahoo lacks an integrated, coherent approach to search. I know I’m not the only person who has observed that Yahoo cannot mount a significant challenge to Google.

As Google’s most capable competitor, Yahoo stayed out of the race. But it baffles me that a sophisticated, hip, with-it Silicon Valley outfit like Yahoo collected different search systems the way my grandmother coveted weird dwarf figurines. Like Yahoo, my grandmother never did much with her collection, I may have to conclude that Yahoo hasn’t done much with its collection of search systems.The cost of licensing, maintaining, and upgrading a fleet of search systems is not trivial. What baffles me is why on earth couldn’t Yahoo index its own email? Why couldn’t Yahoo use one of its own search systems to index Delicious bookmarks and Flickr photos? Why does Yahoo have a historical track record of operating search systems in silos, thus making it difficult to rationalize costs and simplify technical problems?

Compared to Yahoo, Google has its destroyer ship shape — if you call squishy purple pillows, dinosaur bones, and a keen desire to hire every math geek with an IQ of 165 on the planet “ship shape”. But Yahoo is still looking for the wharf. As Google churned past Yahoo, Yahoo watched Google sail without headwinds to the horizon.Over the years, I’ve been in chit-chats with some Yahoo wizards. Let me share my impressions without using the wizards’ names:

  1. Yahoo believes that its generalized approach is correct as Google made search the killer app of cloud computing. Yahoo’s very smart people seem to live in a different dimension
  2. Yahoo believes that its technology is superior to Google’s and Microsoft’s. When I asked about a Google innovation, Yahoo’s senior technologist told me that Yahoo had “surprises for Google.” I think the surprise was the hostile take over bid last week
  3. Yahoo sees its future in social, Web 2.0 services. To prove this, Yahoo hired economists and other social scientists. While Yahoo was recruiting, the company muffed the Facebook deal and let Yahoo 360 run aground. Yo, Yahoo, Google is inherently social. PageRank is based on human clicks and human-created Web pages. Google’s been social since Day One.

To bring this listing of Yahoo search triremes (ancient wooden war ships) to a close, I am not sure Microsoft, if it is able to acquire Yahoo, can integrate the fleet of search systems. I don’t think Mr. Murdoch can given the MySpace glitches. Fixing the flotilla of systems at Yahoo will be expensive and time consuming. The catch is that time is running out. Yahoo appears to me to be operating on pre-Internet time. Without major changes, Yahoo will be remembered for its many search systems, leaving pundits and academics to wonder where they came from and why. Maybe these investigators will use Google to find the answer? I know I would.

Stephen Arnold, February 3, 2008

Search Frustration: 1980 and 2008

February 2, 2008

I have received two telephone calls and several emails about user satisfaction with search. The people reaching out to me did not disagree that users were often frustrated with systems. I think the contacts were amplifications of the complexity of “getting search right”.

Instead of falling back on bell curves, standard deviations, and more exotic ways to think about populations, let’s go back in time. I want to then jump back to the present, offer some general observations, and then conclude with several of my opinions expressed as “observations”. I don’t mind push back. My purpose is to set forth facts as I understand them and stimulate discussion.

I’m quite a fan of Thucydides. If you have dipped into his sometimes stream-of-consciousness approach to history, you know that after a few hundred pages the hapless protagonists and antagonists just keep repeating their mistakes. Finally, after decades of running around the hamster wheel, resolution is achieved by exhaustion.

My hope is that with regard to search we arrive at a solution without slumping into torpor.

The Past: 1980

A database named ABI / INFORM (pronounced as three separate letters ay-bee-eye followed by the word inform) was a great online success. Its salad days are gone, but for one brief shining moment, it was white hot.

The idea for ABI (abstracted business information) originated at a university business school, maybe Wisconsin but I can’t recall. It was purchased by my friend Dennis Auld and his partner Greg Payne. There was another fellow involved early on, but I can’t dredge his name up this morning.

The database summarized and indexed journals containing information about business and management. Human SMEs (subject matter experts) read each article and wrote a 125-word synopsis. The SMEs paid particular attention to making the abstract meaty; that is, a person could read the abstract and get the gist of the argument and garner the two or three key “facts” in the source article. (Systems today perform automatic summarization, so the SMEs are out of a job.)

ABI / INFORM was designed to allow a busy person to ingest the contents of a particular journal like the Harvard Business Review quickly, or collect some abstracts on a topic such as ESOPs (Employee Stock Ownership Plans) and learn quickly on what was in the “literature” (a fancy word for current management thinking and research on a subject).

Our SMEs would write their abstracts on special forms that looked a lot like a 5″ by 8″ note card (about the amount of text on a single IBM mainframe green screen input form). SMEs would also enter the name of the author or authors, the title of the article, the source journal, and the standard bibliographic data taught in the 7th grade.

SMEs would also consult a printed list of controlled terms. A sample of a controlled term list appears below. Today, these controlled term lists are often called knowledge bases. For anyone my age, a list of words is pretty much a list of words. Flashy terminology doesn’t always make points easier to understand, which will be a theme of this essay.

Early in the production cycle, the index and abstract for each article would be typed twice once by an SME on a typewriter and then by a data entry operator into a dumb terminal. This type of information manufacturing reflected the crude, expensive systems available a quarter century ago. Once the data had been keyed into a computer system, it was in digital form, proofed, and sent via eight-track tape to a timesharing company. We generated revenue by distributing the ABI / INFORM records via Dialog Information Services, SDC Orbit, BRS, and other systems. (Perhaps I will go into more detail about these early online “players” in another post.) Our customers used the timesharing service to “search” ABI / INFORM. We split the money with the timesharing company and generally ended up with the short end of the stick.

Below is an example of the ABI / INFORM controlled vocabulary:

abi_vocabsnippet

There were about 15,000 terms in the vocabulary. If you look closely, you will see that some terms are market “rt” and “uf”. These are “related terms” and “use for” terms. The idea was that a person assigning index terms would be able to select a general term like “market shares” and see that the related terms “competition” and “market erosion” would provide pertinent information. The “uf” or “use for” reminded the indexer that “share of market” was not the preferred index term. Our vocabulary could also be used by a customer or user whom we then called a searcher in 1980.

A person searching for information in the ABI / INFORM file (database) of business abstracts could use these terms to locate precisely the information desired. You may have heard the terms precision and recall used by search engine and content processing vendors. The idea originated with the need to allow users (then called searchers) to narrow results; that is, make them more precise. There was also a need to allow a user (searcher) to get more results if the first result set contained too few hits or did not have the information the user (searcher) wanted.

To address this problem, we created classification codes and assigned these to the ABI / INFROM records as well. As a point of fact, ABI / INFORM was one of the first, if not the first, commercial database to reindex every record in its database to assign manually six to eight index terms and classification codes as part of a quality assurance project.

When we undertook this time-consuming and expensive job, we had to use SMEs. The business terminology proved to be so slippery that our primitive automatic indexing and search-and-replace programs introduced too many indexing red herrings. My early experience with machine-indexing and my having to turn financial cartwheels to pay for the manual rework has made me suspicious of vendors pushing automated systems, especially for business content. Business content indexing remains challenging, eclipsed only by processing email and Web log entries. Scientific, technical, and medical content is tricky but quite a bit less complicated than general business content. (Again, that’s a subject for another Web log posting.)

Our solution to broadening a query was to make it possible for the SME indexing business abstracts to use a numerical code to indicate a general area of business; for example, marketing, and then use specific values to indicate a slightly narrower sub-category. The idea was that the controlled vocabulary was precise and narrow and the classification codes were broader and sub-divided into useful sub-categories. A snippet of the ABI / INFORM classification codes appears below:

cccodesnippetfixed

If you look at these entries for the classification code 7000 Marketing, you will see terms such as “sn”. That’s a scope note, and it tells the indexer and the user (searcher) specific information about the code. You also see the “cd”. That means “code description”. A “code description” is provides specific guidance on when and how to use the classification code, in this case “7000 Marketing”.

Notice too that the code “7100 Marketing” is a sub-category of Marketing. The idea is that while 7000 Marketing is broad and appropriate for general articles about marketing, the sub-category allows the indexer or user to identify articles about “Market research.” While “Market research” is broad, it is ideally in a middle ground between the very broad classification code 7000 Marketing and the very specific terminology of the controlled vocabulary. We also had controlled terms lists for geography or what today is called “geo spatial coding”, document type codes, and other specialized index categories. These are important facets of the overall indexing scheme, but not germane to the point I want to make about user satisfaction with search and content processing systems.

Let’s step back. Humans created abstracts of journal articles. Humans then complete bibliographic entries for each selected article. Then an SME would index the abstracts, selecting terms that in their judgment and according to the editorial policy inherent in the controlled terms lists. These index terms became the building blocks of locating a specific article among hundreds of thousands or identifying a subset of all possible articles in ABI / INFORM directly on point to the topic on which the user wanted information.

The ABI / INFORM controlled vocabulary was used at commercial organizations to index internal documents or what we would today call “behind-the-firewall content.” One customer was IBM. Another was the Royal Bank of Canada. The need for a controlled vocabulary such as ABI / INFORM’s is rooted in the nature of business terminology. When business people speak, jargon creeps into almost every message. On top of that, new terms are coined for old concepts. For example, you don’t participate in a buzz group today. You participate in a focus group. Now you know why I am such a critic of the baloney used by search and content processing vendors. Making up words (neologisms) or misappropriating a word with a specific meaning (semantic, for example) and then gluing that word with another word with a reasonably clear meaning (processing, for example) creates the jargon semantic processing. Now I ask you, “Who knows what the heck that means?” I don’t, and that’s the core problem of business information. The language is slippery, fast moving, jargon-riddled, and fuzzy.

Appreciate that creating the ABI / INFORM controlled vocabulary, capturing the editorial policy in those lists, and then applying them consistently to what was then the world’s largest index to business and management thought was a big job. Everyone working on the project was exhausted after two years of researching, analyzing, and discussing. What made me particularly proud of the entire Courier-Journal team (organized by the time we finished into a separate database unit called Data Courier) was that library and information science courses used ABI / INFORM as a reference document. At Catholic University in Washington, DC, the entire vocabulary was used as a text book for an advanced information science class. Even today, ABI / INFORM’s controlled vocabulary stands as an example of:

  1. The complexity of creating useful, meaningful knowledge bases
  2. Proof that it is possible to index content so that it can be sliced and diced with few “false drops” or what we call today a “irrelevant hit”.
  3. A difficult domain such as business can be organized and made more accessible via good indexing.,

Now here’s the kicker, actually a knife in the heart to me and the entire ABI / INFORM team. We did user satisfaction surveys on our customers before the reindexing job and then after the reindexing job. But our users (searchers) did not use the controlled terms. Users (searchers) keyed one or two terms, hit the Enter key, and used what the system spit out.

Before the work, two-thirds of the people we polled who were known users of ABI/ INFORM said our indexing was unsatisfactory. After the work, two thirds of the people we polled who were known users of ABI / INFORM said our indexing was unsatisfactory. In short, bad indexing sucked. And better indexing sucked. User behavior was responsible for the dissatisfaction, and even today, who dares tell a user (search) that he / she can’t search worth a darn.

I’ve been thinking about these two benchmark studies performed by the Courier-Journal every so often for 28 years. Here’s what I have concluded:

  1. Inherent in the search and retrieval business is frustration with finding the information a particular user needs. This is neither a flaw in the human nor a flaw in the indexing. Users come to a database looking for information. Most of the time — two thirds to be exact — the experience disappoints.
  2. Investing person years of effort in constructing an almost-perfect epistemological construct in the form of controlled vocabularies is a great intellectual exercise. It just doesn’t pay huge dividends. Users (searchers) flounder around and get “good enough” information which results in the general dissatisfaction with search.
  3. As long as humans are involved, it is unlikely that the satisfaction scores will improve dramatically. Users (searchers) don’t want to work hard to formulate queries or don’t know how to formulate queries that deliver what’s needed. Humans aren’t going to change at least in my lifetime or what’s left of it.

What’s this mean?

Simply stated, algorithmic processes and the use of sophisticated mathematical procedures will deliver better results.

The Present: 2008

In my new study Beyond Search, I have not included much history. The reason is that today most procurement teams looking to improve an existing search system or replace one system with another want to know what’s available and what works.

The vendors of search and content processing systems have mastered the basics of key word indexing. Many have integrated entity extraction and classification functions into their content processing engines. Some have developed processes that look at documents, paragraphs, sentences, and phrases for clues to the meaning of a document.

Armed with these metatags (what I call index terms), the vendors can display the content in point-and-click interfaces. A query returns a result list, and the system also displays Use For references or what vendors call facets, hooks, or adjacent terms. The naked “search box” is surrounded with “rich interfaces”.

You know what?

Survey the users and you will find two-thirds of the users dissatisfied with the system to some degree. Users overestimate their ability and expertise in finding information. Many managers are too lazy to dig into results to find the most germane information. Search has become a “good enough” process for most users.

Rigorous search is still practiced by specialists like pharmaceutical company researchers and lawyers paid to turn over every stone in hopes of getting the client off the legal hook. But for most online users in commercial organizations, search is not practiced with diligence and thoroughness.

In May 2007, I mentioned in a talk at an iBreakfast seminar that Google had an invention called “I’m feeling doubly lucky.” The idea is that Google can look at a user’s profile (compiled automatically by the Googleplex), monitor the user’s location and movement via a geo spatial function in the user’s mobile device, and automatically formulate a query to retrieve information that may be needed by the user. So, if the user is known to be a business traveler and the geo spatial data plot his course toward La Guardia Airport, then the Google system will push to the user’s phone about which parking lot is available and whether the user’s flight is late. The key point is that the user doesn’t have to do anything but go one about his / her life. This is “I’m feeling doubly lucky” because it raises the convenient level of the “I’m feeling lucky button” on Google pages today. Press I’m feeling lucky and the system shows you the one best hit as defined by Google’s algorithmic factory. Some details of this invention appear in my September 2007 study, Google Version 2.0.

I’m convinced that automatic, implicit searching is the direction that search must go. Bear in mind that I really believe in controlled vocabularies, carefully crafted queries, and comprehensive review of results lists. But I’m a realist. Systems have to do most of the work for a user. When users have to do the searches themselves or at least most of the work, their level of dissatisfaction will remain high. The dissatisfaction is not with the controlled vocabulary, the indexing, or the particular search system. The dissatisfaction is with the work associated with finding and using the information. I think that most users are happy with the first page or first two or three results. These are good enough or at least assuage the user’s conscience sufficiently to make a decision.

The future, therefore, is going to be dominated by systems that automate, analyze, and predict what the mythical “average” user wants. These results will then be automatically refined based on what the system knows about a particular user’s wants and needs. The user profile becomes the “narrowing” function for a necessarily broad set of results.

Systems can automatically “push” information to users or at least keep it in a cache ready for near-zero latency delivery. In an enterprise, search must be hooked into work flow. The searches must be run for the user and the results displayed to the user. If not automatically, the user need only click a hot link and the needed information is displayed. A user can override an automatic systems, but I’m not sure most users would do it or care if the override were like a knob on a hotel’s air conditioner. You feel better turning the knob. You feel without control if you can’t turn the knob.

Observations

Let me offer several observations after this journey back in time and a look at the future of search and content processing. If you are easily upset, grab your antacid, because here we go:

  1. The razzle-dazzle about taxonomies, ontologies, and company-specific controlled term lists hides the fact that specific terms have to be identified and used to index automatically documents and information objects found in behind-the-firewall search systems. Today, these terms can be generated by processing a representative sample of existing documents produced by the organization. The key is a good-enough term list, not doing what was done 25 years ago. Keep in mind the phrase “good enough.” There are companies who offer software systems that can make this list generation easier. You can read about some vendors in Beyond Search, or you can do a search on Google, Live.com, or Yahoo.
  2. Users will never be satisfied. So before you dump your existing search system because of user dissatisfaction, you may want to get some other ammunition, preferably cost and uptime data. “Opinion” data are almost useless because no system will test better than another in my experience.
  3. Don’t believe the business jargon thrown at you by vendors. Inherent in business itself is a tendency to create a foggy understanding. I think the tendency to throw baloney has been around since the first caveman offered to trade a super-sharp flint for a tasty banana. The flint is not sharp; it’s like a Gillette four-track razor. The banana is not just good; it is mouth-watering, by implication a great banana. You have to invest time, effort, energy, and money in figuring out which search or content processing system is appropriate for your organization., This means head-to-head bake-offs. Few do this, and the results are clear. Most people are unhappy with their vendor, with search, and with the “information problem”.
  4. Background processes, agent-based automatic searching, and mechanisms to watch what your information needs and actions are will make search better. You enter ss cc=71? AND ud=9999 to get recent material about market research. but most people don’t and won’t.

In closing, keep these observations in mind when trying to figure out what vendors are really squabbling about. I’m not sure they themselves know. When you listen to a sales pitch, are the vendors saying the same thing? The answer is, “Yes.” You have to rise to the occasion and figure out the differences between systems. I guarantee you the vendors don’t know and if they know, the vendors sure won’t tell you.

Stephen Arnold, February 2, 2008

Search Saber Rattling

February 1, 2008

The Washington Post, January 31, 2008, ran a story “Google Slams Autonomy over Enterprise Search Claims.” The subtitle was, “Google Says Autonomy’s White Paper Contains ‘Significant Inaccuracies’ about its Search Appliance.”

The gist of the story, as I understand it, is that Autonomy wrote a white paper. The white paper contains assertions that the Google Search Appliance is not as good as Autonomy’s search engine. The Autonomy white paper is here. Google’s response is here.

What’s a White Paper?

For those of you not familiar with the lingo of high-tech marketing, a white paper is an essay, usually three or four pages to 50 pages or more. The author, usually an “expert”, opines on a particular topic, including facts, assertions, data from “objective tests”, and other sources. The idea is that a white paper presents information that supports an argument. If you want to immerse yourself in white papers, navigate to Bitpipe, and sign up. The young founders created a treasure trove of these documents after a stint at the Thomson Corporation. White papers, based on my experience, are among the favorite reads of graduate students in far-off places. Bitpipe reports heavy usage of their repository. My test a couple of years ago revealed zero substantive leads from a white paper about behind-the-firewall search. My hunch is that these documents occupy 20-something public relations experts, their superiors, and, of course, the hundreds of graduate students looking for information. Maybe some legitimate buyers order up several million dollars worth of computer gear after reading a white paper, but I think the white papers’ impact might be more limited; for example, competitors seem to read one another’s white papers. I scan them, but mostly I focus on the specifications (if any are included) and the technical diagrams (also rare as hen’s teeth).

I keep a collection of white papers published by the 52 search and content processing companies I track. I don’t want to dig into the Autonomy white papers or the mind-numbing complexities of the Google essays here.

The majority of white papers are like sonnets in the 16th century. There’s a convention, and vendors follow the convention. The individual white papers are formulaic. Most white papers arguing that the sponsor’s or author’s product is not just good but really very good.

About half the white papers take implicit or explicit swipes at competitors’ products. I’m not sure these swipes are harmless, but a white paper is not going to have the impact of a story by Walt Mossberg in Rupert Murdoch’s Wall Street Journal. Furthermore, the writing of a white paper is not going to get anyone a Ph.D. or even a high grade in a first-year writing class.

The objective of a white paper is to make a sale or help a fence-sitting prospect to make the “right” decision. The sponsor or author of the white paper wants to paint a clear picture of one product. The competitors’ products are so-so. White papers are usually free, but to download one, you may have to register. You become a sales lead.

Why the Fuss?

I understand the frustration of search vendors who find their product or service criticized. Search systems are terribly complex, generally not well understood by their licensees, and almost always deemed “disappointing” by their users. Marketers can suggest and imply remarkable features of their employers’ search systems. Hyperbole sells in some situations.

I’ve made reference to a major study we conducted in 2007. The data suggested that two-thirds of a behind-the-firewall search system’s users were dissatisfied or somewhat dissatisfied with their search engine. It didn’t seem to make much difference whose system the respondent had in mind. The negative brush painted a broad swath across the best and brightest vendors in the search market place. In December, I learned from a colleague in Paris, France, that she found similar results in her studies of search system satisfaction.

To lay my cards on the table, I don’t like any search system all that much. You can read more about search and content processing warts in my new study, Beyond Search, available in April 2008. Task-master Frank Gilbane has me on schedule. I assume that’s one reason his firm has a good reputation. In Beyond Search, I discuss what makes people unhappy when they use commercial search systems, and I also offer some new information about fixing a broken system.

Dust Up: Not the World Wrestling Federation

The recent dust up reported by the Washington Post and dozens is that Autonomy and Google are shaking their PR swords at one another. I find that amusing because no one outside of a handful of specialists have the foggiest idea what makes each respective company’s system work. I recall a fierce argument about Spencer’s Faerie Queen. I don’t anyone knew what the distinguished combatants were talking about.

Autonomy

IDOL stands for Integrated Data Operating Layer. The Autonomy approach is to put a “framework” for information applications in an organization. The licensee uses the IDOL framework to acquire, process, and make available information. You can run a search, and you can process video, identify data anomalies, and output visual reports. The system, when properly configured and resourced, is amazing. Autonomy has thousands of customers, and based on the open source intelligence available to me, most are happy. You can read more about the Autonomy IDOL system at www.autonomy.com. There’s a long discussion of the IDOL framework in all four editions of the Enterprise Search Report, which I authored from 2003 to 2006 and some of my thoughts linger in the 4th edition.

Google Search Appliance

The GSA is a server or servers with a Google search system pre-installed. This is a “search toaster,” purpose built for quick deployment. You can also use a GSA for a Web site search, but the Google Custom Search Engine can do that job for free. The GSA comes with an API called the “One Box API”. In my research for the first three editions of the Enterprise Search Report, I kept readers up to date on the evolution of the Google Search Appliance. My assessment in the first edition of ESR was that GSA was outstanding for Web site search and acceptable for certain types of behind-the-firewall requirements. Then in editions two and three of ESR, I reported on the improvements Google was making to the GSA. The Google wasn’t churning out new versions every few months, but it was making both incremental and significant improvements. Import filters improved. The GSA became more adept with behind-the-firewall security. With each upgrade to the GSA, it was evident that Google was making improvements.

When the One box API came along maybe a year or two ago, the GSA morphed from an okay solution into a search stallion for the savvy licensee. Today’s GSA and One box API duo are hampered by a Googley, non-directive sales plan. Google, in a sense, is not in a hurry. Competitors are.

Differences between Autonomy IDOL and GSA

Autonomy has a solid track record of knowing what the “next big thing in search” will be. The company’s top management seem to be psychic. There was “portal in a box”. Very successful. Great timing. There was Kenjin (remember that?), a desktop search application. Industry-leading and ahead of its time. And there was IDOL itself. Autonomy invented the notion of an information operating platform. Other vendors like Fast Search & Transfer jumped on the Autonomy idea with enthusiasm. Now most search vendors offer a “platform” or a “framework”.

Let’s look at some differences in the two competitors’ systems:

Autonomy IDOL GSA (One box API)
Platform

On premises installation using licensee’s infrastructure

Servers available in different configurations

Deployment

Custom installation

Toaster approach. Plug in the boxes

Features

Myriad. Mostly snap in with mild customization

Code your own via the One Box API

Support

Direct, partners, and third-parties not affiliated with Autonomy

About 40 partners provide support, customization, etc.

Autonomy IDOL is a “classic” approach to enterprise systems. A licensee can mix and match features, customize almost every facet of the system, and build new applications on IDOL. IDOL runs on a range of operating systems. IDOL includes snazzy visualization and report services. The licensee has one responsibility — ensuring that the resources required by the system are appropriate. With appropriate resources, Autonomy is a very good content processing and search system.

Google’s approach is quite different. The GSA is a plug-and-play solution. A licensee can do some customization via style sheets and the GSA’s administrative utility. But for really interesting implementations, the licensee or one of the three dozen partners have to roll up their sleeves and write code. Google wants its customers “to get it.”

Neither IDOL nor GSA is objectively better than the other; both can do an outstanding job of search. Both can deliver disappointing results. Remember, none of the hundreds of search and content processing systems is able to please most of the users most of the time.

Full Circle

This brings me back to the white paper dust up. I don’t think that dueling news releases will do much to change who buys what. What I think is going on is pretty obvious. Let me make my view clear:

Google’s GSA is gaining traction in the behind-the-firewall search segment. I heard that the enterprise unit has tallied more than 8,500 GSA sales as of December 31, 2008. Information about the One Box API is beginning to diffuse among the technical crowd. Google’s own PR and marketing is — how can I phrase it? — non-directive. Google’s is not in-your-face when it comes to sales. Customers have to chase Google. But this white paper affair suggests that Google may have to change. Autonomy knows how to toss white paper grenades at the Google. The Google has some grenades to toss at Autonomy. For example, you can dip into this Google white paper, Algorithm für dynamische geometrische Datenströme, Gereon Frahling, Ausgezeichnete Informatikdissertationen 2006.

What the Saber Rattling Does Reveal

Google, despite being Googley, is no longer the elephant in the search engine play house no one sees or talks about. I also learned that:

  1. The battle for search mind share is no longer between and among the traditional on-premises players like Endeca, Exalead, and ISYS Search Software. Autonomy has certified Google as the competitor for 2008.
  2. Google, judging by its response, is beginning to realize it needs some additional marketing chutzpah.
  3. The media will continue toss logs on the fire in the best tradition of “If it bleeds, it leads” journalism
  4. Prospects now have reason to equate Autonomy and Google.

Let me be 100 percent clear. No search system is perfect. None of the marketing is particularly useful to potential licensees who remain confused about what system does what. Some vendors still refuse to make a price list available. (Now that’s a great idea when selling to the U.S. government.) And, none of the currently shipping systems will deliver customer satisfaction scores Rolls-Royce enjoys.

Search is difficult. Its value proposition fuzzier and harder to demonstrate. The resources available to a licensee make more difference than the specific system deployed. A so-so search system can deliver great results when the users’ requirements are met, the system doesn’t time out, and the needed content has been indexed. Any commercial search engine will flat out fail when the licensee doesn’t have the expertise, money, or patience to resource the search system.

Want to know which vendor has the “best” system? Get 300 identical servers, load up the same content on each, and let your users run queries. Ask the users which system is “best”. Know what? No one does this. Search is, therefore, a he-said, she-said business. The fix. Do head-to-head bake offs. Decide for yourself. Don’t let vendors do your thinking for you.

Stephen Arnold, February 1, 2008

« Previous Page

  • Archives

  • Recent Posts

  • Meta