Search Fundamentals: Cost

August 10, 2008

Set aside the fancy buzz words like taxonomies, natural language processing, and automatic classification. I want to relate one anecdote from a real-life conversation last week and then review five search fundamentals.

Anecdote

I’m sitting in a fancy conference room near Tyson’s Corner. The subject is large-scale information systems, not search. But search was assumed to be a function that would be available to the larger online system. And that’s where the problem with search fundamentals became a time bomb. The people in the room assumed that search was not a problem. One could send an email to one of the 300 vendors in the search and content processing market, negotiate a licensing deal, install the software, and move on to more important activities. After all, search was a mud flap on a very exotic sports car. Who gets excited about mud flaps?

The situation is becoming more and more common. I think it is a consequence of Googling. Most of the people with whom I meet in North America use Google for general Web search. The company’s name has become a verb, and the use of Google is becoming more ubiquitous each day. If I open Firefox, I have a Google search box available at all times.

If Google works, how hard can search be?

Five Fundamentals

I have created a table that lists five search fundamentals. Feel free to scan it, even recycle it in your search procurement background write ups. I want to make a few comments about each fundamental and then wrap up this essay with what seems to me to be an obvious caution. Table after jump.

Assumption Comment
Search is one “thing” Search is a basket containing many different functions
Search functions are pretty simple Search sub systems are complex even in basic form like the free desktop search from Google, Microsoft, or X1
Today’s servers are fast enough to handle any search operation Multiple servers are required to handle enterprise search systems when there is a lot of content to process quickly, spikes in the number of users online at any one time, and when content grows and changes
Search costs are easy to control*** Search costs are difficult to control because of the system’s demand for resources, unexpected demand spikes, and the on going need to customize, tune, and optimize the systems for search and associated with search.
Semantic technology is ready for prime time Semantic technology is ready for prime time if you have sufficient computing horse power, appropriate technical capabilities, and a mechanism to clean up and format content for the semantic sub system

First, search is not a thing. Search is a number of different functions. Some of these are straight forward; others give computer scientists headaches. The perception that search is a single application is wide spread. Ignorance of the number of moving parts under the hood of a search engine makes it easy to create false assumptions, inflate expectations, and make decisions that have long term cost complications. The reason the third edition of the Enterprise Search Report was almost 600 pages was a direct consequence of feedback from purchasers of the first two editions of that report I wrote. Readers wanted more explanation. Even the simplest function such as synonym expansion effloresces into different technical options that are in themselves really tough to understand in terms of cost, computational efficiency, preparatory work, and interaction with other parts of the search system. The reaction is, “Now that I know what this means, please, tell me how to decide what to do in my specific application.” A cursory look at such details requires information that most people involved in search procurements do not have, don’t have the time to dig out, and lack the training to understand the cost implications of each option.

Second, search functions are pretty simple. This is not true. Even the most basic search function–for example, looking for the name “John Smith” in an email archive–is tough to deliver. If you want to see the problems, fire up a copy of Outlook Express, Outlook any version, or Thunderbird and try to find “John Smith”. Chances are you have to go through a number of steps to specify where you want the system to look. Even then you have to wait while the search system grunts through the email looking for a match. At the most brain dead level of search, the system is sucking resources, writing temporary files, and gearing up to output a list of possible matches. Now try to discover the people related to “John Smith” in five gigabytes of email obtained through the legal discovery process. You get the idea. When you try to scale search, you push the boundaries of systems engineering. Google has spent a decade figuring out how to scale, and it experiences system failures such as the one that killed Google Apps in the first week of August. When a company thinks that search is “pretty simple”, I refuse to work that organization. The grim reaper is putting on his pads and thinking, “This will be fun.”

Third, today’s servers are super fast and can handle any search task thrown at them. Wrong. The computational demands of search are not fully appreciated until you have your $3 million search system die without warning. Fast CPUs are not enough. The CPUs have to be lashed to tasks. The tasks have to be orchestrated to keep the processing pipelines moving. A dearth of resources in any one of the complex sub systems can kill the system. The big guys like Amazon and Microsoft spends hundreds of millions of dollars to keep their systems from choking and dying. Most search licensees lack the resources to stay ahead of the search system’s continuing demand for resources. Engineering and hardware must be applied intelligently and in an on going activity. The single super fast server won’t be able to index one document if its supporting infrastructure is flawed. A fast CPU or 20 won’t be sufficient. Think money for infrastructure, engineering, hardware, and optimization for the life of the search system.

Search costs are not easy to control. There are four reasons, not usually understood until a person has been through a search deployment and confronted the budget issues.

  1. Failures. These must be addressed immediately. Fixing a problem is often open ended, so there is no way to control the costs short of leaving the system broken.
  2. Hardware and infrastructure. Most search slow downs can only be remediated by throwing hardware at the problem. The hardware fix is only temporary, so when the slow down next occurs, you get to buy more hardware. Because hardware is not available at moment’s notice, then expensive work arounds are needed. Costs increase quickly.
  3. People. Some search problems are puzzles. If a person can’t solve the puzzle, then other people must be paid to tackle the problem. Solving an open ended problem related to search and content processing can consume significant time. Time equals money. The amount of time needed to solve the puzzle is not known, so the money required is not known either.
  4. Features. A function doesn’t do what you want done. The fix is to create a new function. If you can do this yourself, that’s a sunk or displaced cost. If you have to pay someone to create the function, you pay. If there is a problem with the function, then you are back in item three’s cost sink hole. But the killer with new functions is unexpected dependencies. What the new function does may cause an unexpected event elsewhere in the system. You may be thrown back into item two or item three. This is a cost feedback loop, which is tough to dampen.

Semantic technology is ready for prime time. Absolutely. You can make it work if you have the appropriate resources. This means [a] infrastructure, [b] the right people with the right expertise, and [c] a tight specification for what you want the system to do and the right semantic vendor to deliver what you need. You can see that the specter of issue one, two, three and four haunt you with zippy new search and content processing technology.

Net Net

It is possible to deploy a functional, reliable, fiscally responsible search and content processing system. You need to follow some simple “rules of the road” to make this happen:

  1. Avoid watching Star Trek and assuming what you see in reruns can be done in your organization
  2. Keep the functionality basic and when the basics are working, add new features over time
  3. Create a tight specification and contract with vendors to deliver to that specification. If the vendor doesn’t meet the specification, call your attorney  and get your money back
  4. Manage expectations and keep those expectations realistic
  5. No single system is perfect. That goes for Google, IBM, Microsoft, Oracle, and 296 other companies. Expect to work at search over a period of time.

Stephen Arnold, August 12, 2008

Comments

6 Responses to “Search Fundamentals: Cost”

  1. Search Fundamentals: Cost | Easycoded on August 10th, 2008 11:28 am

    […] 10, 2008 ScottGu Set aside the fancy buzz words like taxonomies, natural language processing, and automatic […]

  2. Yaser Bishr on August 10th, 2008 4:42 pm

    Nice post Stephen

    I suggest another assumption: BI and search are different.

    BI and Enterprise search are becoming two sides of the same coin. I predict that search will become as common as email in the Enterprise once there is some understanding of the business value this convergence brings. See SAP acquisition of Business Objects who earlier acquired Inxight. Executives now understand that more knowledge sharing across the enterprise will lead to better operational performance. Developing a corporate strategy that encompasses both is not an option anymore. Doing so will help drive performance management and make BI information more accessible closer to the base of the organizational pyramid.

  3. Stephen E. Arnold on August 10th, 2008 8:37 pm

    Yaser Bishr,

    Thanks for taking the time to comment. I agree but wish to suggest that the we are not dealing with “two sides”; the dimensionality is more complex.

    Stephen Arnold, August 10, 2008

  4. Data Centers: Part of the Cost Puzzle : Beyond Search on August 11th, 2008 7:22 am

    […] The “commentary” includes a table with data that backs up his analysis. The data are useful but as you will learn at the foot of this essay, offer only a partial glimpse of a more significant cost issue. You may want to read my modest essay about cost here. […]

  5. Breakfast Links on August 12th, 2008 3:33 am
  6. INFO Kappa » Esentialul despre cautarea pe Internet IK 314 on August 31st, 2010 1:27 pm

    […] Despre cautare si costul ei- lucrurile nu sunt atat de simple pe cat par: http://arnoldit.com:80/ […]

  • Archives

  • Recent Posts

  • Meta