Search: The Three Curves of Despair

March 27, 2008

For my 2005 seminar series “Search: How to Deliver Useful Results within Budget”, I created a series of three line charts. One of the well-kept secrets about behind-the-firewall search is that costs are difficult, if not impossible, to control. That presentation is not available on my Web site archive, and I’m not sure I have a copy of the PowerPoint deck at hand. I did locate the Excel sheet for the chart which appears below. I thought it might be useful to discuss the data briefly and admittedly in an incomplete way. (I sell information for a living, so I instinctively hold some back to keep the wolves from my log cabin’s door here in rural Kentucky.)

Let me be direct: Well-dressed MBAs and sallow financial mavens simply don’t believe my search cost data.

At my age, I’m used to this type of uninformed skepticism or derisory denial. The information technology professionals attending my lectures usually smirk the way I once did as a callow nerd. Their reaction is understandable. And I support myself by my wits. When these superstars lose their jobs, my flabby self is unscathed. My children are grown. The domicile is safe from creditors. I’m offering information, not re-jigging inflated egos.

Now scan these three curves.

You see a gray line. That is the precision / recall curve. This refers to a specific method of determining if a query returns results germane to the user’s query and another method for figuring out how much germane information the search system missed. Search and a categorical affirmative such as “all” do not make happy bedfellows. Most folks don’t know what a search system does not include. Could that be one reason why the “curves of despair” evoke snickers of disbelief?

There is a complexity curve. As more search and content processing functions are stuffed into an information retrieval system, the system becomes more complicated. Complex, in the sense in which I use the word, embraces two ideas. First, today’s search system is not a single entity. The search system is a fruit basket containing different digital goodies. Vendors provide you with a crawler, a transformation component, an entity extractor, and dozens of other digital functions. You don’t have to be a math wizard to realize that when several systems interact, numerous points of failure exist. Add another function such as part of speech extraction, and the search system bumps up the complexity curve. Toss in a third-party tool and the search system noses upwards. Second, when the search system is made available to users, life gets a great deal more complicated and quickly. A research demo is very simple and, therefore, not representative of the real world. You know about human-generated complexity. Think about the number of actions required to get your kid from home to soccer practice or to the school dance.

The third curve is the thermometer for the financial “heat” generated as the other two curves move through time. Notice that I have used the x-axis — the horizontal line, or as I tell the Boards of Directors with whom I speak, “the line that goes from your left to your right” — as a series of dates. A glance at the time marks on the x-axis and a cursory glance at the three curves reveals that costs go up over time. Notice that the gray line–the one named precision / recall curve– does not go up. In fact, this line reflects data from my research that shows precision / recall has stalled in the 80 to 90 percent range. Without some major work, it’s unlikely that precision / recall will get much better if vendors continue to layer processes on key word systems. (No, I won’t identify the way out of this dead end in this essay. I will tell you that this curve makes clear why search will be a big pay day as we move forward in time.)

Note that the cost curve rises more sharply than the complexity curve. Logically, if a system gets more complex, costs should rise in step with that complexity. Wrong. Costs actually rise more rapidly even though precision / recall doesn’t improve. This is one of the most important yet frequently ignored facts in the behind-the-firewall search business.

The Curves of Despair

These curves are bad news for anyone working out next year’s search-and-retrieval budget using simple linear assumptions. Search costs can exponentiate, often without warning. My data suggest that you will not have sufficient money if you rely on “rules of thumb” for budgeting. Let me hit the major four reasons, and beg you to keep in mind that I have provided more detail in Beyond Search, my new study to be published by the Gilbane Group in early April. This is a Web log, not the whole data enchilada.

First, the complexity of today’s search systems translates to expensive troubleshooting when the system goes off the rails. How often does this happen? The system crashes and those involved spend money like there is no tomorrow. Ah, but there is a tomorrow.

Second, a noisy user–maybe marketing or a dour senior vice president–demands a feature. No problem as long as someone knows how to implement the feature without triggering an unexpected dependency like a crash or failed updates.

Third, the IT department’s “We have the hardware and expertise to handle the load” proves to be baloney. IT people are not usually search people. If they are search people and really good, these folks are going to be working at Google and not your trade association, government agency, or mid-market insurance company. Armed only with ignorance of what search requires, you suddenly run out of something–storage, bandwidth, computational cycles, or RAM. The fix is to call people like me, pay the vendor, or get a hardware vendor to help you throw hardware at the problem. Sounds familiar, doesn’t it?

Finally, you or one of your colleagues has learned about computers and software via experimentation. Electrical engineers and math majors are successful because they hack. These gals don’t read manuals about systems. These folks solve problems by tinkering, relying on their innate brilliance to back out of a problem, and trying new attacks. The problem is that fiddling with a complicated system like search and content processing can kill the system. Search and content processing vendors have slapped graphical interfaces on their administrative systems. Google has locked down the administrative options for its search appliance. Sadly an EE or math wonk can easily interact with a search system via hex editors and APIs. Once you’ve killed a search system, you may be back at ground zero which means reinstalling the system, re indexing the content, and resetting the custom controls.

The curves of despair make evident that fact that you have to be a savvy manager to get your search and content processing system up and running, stable, and delivering useful results when the deck is stacked against you. The vendor can snooker you. Your IT department can cut you off at the knees. Your colleagues–even the cute person in marketing–can turn into a banshee.

Each of these triggers causes money to teleport itself from your budget into the bank accounts of others.

Wrap Up

Here’s what I hear frequently. “You are telling me things that I never heard before. Our IT department is outstanding. Our vendors are first rate. Our users have embraced our collegial culture. Furthermore, I have more than a decade of experience managing complex projects. Frankly, I don’t believe you.”

You have my permission to ignore the curves of despair. If you are a vendor, you can ignore these curves and pooh-pooh them if an ambitious customer references them. If you are an IT person, you know much more than I do, so the curves are not worthy of comment. If you are the person running the search project, you can muddle forward knowing like Nero what awaits him when his private secretary shows up with the knife he will use to end his remarkable life. Nero, a lousy manager, said as he died, “Qualis artifex pereo!”

In my work, two-thirds of the enterprise search systems are without comprehensive expense data. One implication of this is money is spent as needed. A routine quarter end budget report shows a spike in computer and system expenses. A clerk works backward to understand why expenses for consulting, software licenses, overtime, and data center services. A cursory examination shows that the behind-the-firewall search system has been the epicenter of a budget overrun. No one knew, of course, because “everyone” assumed that the license fee was the cost, the existing hardware was adequate, and the system wouldn’t need baby sitting every two or three days, often with triage without warning.

Is there a way to deploy a search and retrieval system that stays within budget? The answer is, “Yes.” Is there a way to use content processing for business intelligence without falling into a river of red ink? The answer is, “Yes.” Will you tell me how to achieve this end in this Web log? The answer is, “No.”

In closing, the purpose of this essay is to set forth some basic insights into the problem of cost control as it relates to search and content processing. My objective is to educate, not convince.

Stephen Arnold, March 27, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Feature

Comments

One Response to “Search: The Three Curves of Despair”

links for 2009-10-20 « CF Bloke blog on October 20th, 2009 8:01 am

[…] Search: The Three Curves of Despair : Beyond Search (tags: web2.0) […]

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.