Time: Your Search Infrastructure’s Deadliest Enemy

April 8, 2008

Look at these two diagrams.

The first diagram–the one with a single red arrow–represents the volume of content your search and content processing system must process. Unless you work in a very unusual organization, you and your colleagues produce digital information. Let’s confine our discussion to text, but keep in mind that your organization will also produce digital images, audio, and video. At some point your search or content processing system will have to deal with these types of content as well.

The story in the chart with the single error is simple: each day the amount of content increases. The content is very distinct, and it falls into one of three broad categories. Some content is original; that is, there is no previous version or draft. You or one of your colleagues generates an original document and stores it on her computer. If your colleague is a tele worker, she might upload the document to your company’s server.

The second type of content is information that comes into your organization from a third party or an outside source. Some organizations use “free” information found on public Web sites. The information can be copied to a hard drive so someone in the marketing department can keep track of a competitor’s actions or follow a news story. Most people assume that if there is a story or document available “on the Internet”, it’s okay to use it for research. College student learn to gather information, read it, select bits and pieces, and prepare a footnote to tell the reader the information came from this source.

The third type of content is pre-existing information that exists in different versions. Let me give you an example. When I worked for a large, publicly-traded company, we issued a quarterly news release. Instead of writing an original news release every 12 weeks, we took a previous news release, inserted the new information, and made sure the headline and date were in sync with the present quarter. We did not “create” an original document, we created a new document based on a previous document. We can argue that this is a new document, but for our purposes, let’s consider this a version of an earlier document.

A search or content processing system usually recognizes a “new” or “changed” document using some basic techniques. In fact, few people think much about these because the function is so basic, so deeply embedded in the search or content processing system, that no one gives these a second thought. One technique is to look at the file date and time stamp. A document with today’s date is identified by the search or content processing system, it’s indexed. Pretty simple.

Now look at the second chart. There’s a blue arrow at it’s labeled system capacity. Unlike the red arrow, this blue arrow starts out rising and then intersects with the red arrow and begins to head south. This simple chart tells us that any search infrastructure degrades over time. There are many possible reasons for a search and content processing system to lose steam.

I want to focus on a broader issue; namely, any search and content processing system will lose steam. In fact, even if you have an upgrade plan and implement it, your existing search and content processing system be unable to maintain content processing throughput and query response time you experienced when the system and you were on your honeymoon.

Meet Your Enemy–Time

Let’s look at this assertion more closely. Some pundits rarely acknowledge performance. These mavens point to the nifty features of a system, never passing within shouting distance of the “time nemesis”. Your IT department almost certainly ignores time-centric derogation. Those with “grease on their hands” from working with search and content processing systems understand how the whole suffers no matter how much attention and love we languish on the parts of the search or content processing system.

Let’s look at the reasons:

First, crashes and corrupt indexes are inevitable. When one occurs, the search system must be rebuilt. The best case scenario is the downtime is limited because a system restore works. The newly-revived system needs to play catch up, maybe for an hour or two, possible a day. But what happens when the crashes occur after a system has been operating for a year and the restores no longer work. To get back online, you reindex and reprocess the content. Unless you have a digital genii at your elbow, the longer a system is in operation, the greater the time cost of these crashes. At some point, you cannot restore in the time available, and you have confronted the reality of an aging system.

Second, when the volume of content increases, the system has more work to do. Because hot spots can occur without any indication of a cause, you must invest time in finding, troubleshooting, and fixing the problem. The volume of content itself may be the trigger, and you have no choice short of throwing hardware at the problem to keep pace. Once you tack on software fixes and throw hardware into the mix, you have increased the likelihood of unexpected problem. At this point, you revisit the crashes and corrupt indexes problem.

Third, here are three everyday situation that affect performance:

The number of users of your search and content processing system increases, you experience performance degradation
The size of the individual documents your process rises; for example, the new XML file formats are often beefier than plain text created in BBEDIT or Notepad
The number of queries your system processes goes up, possibly because users actually exercise the system or because standing queries (alerts) or calls from other enterprise systems keep your search bunnies hopping.

Let’s assume your do some software tuning and throw hardware at a performance problem. Any one or combination of these three factors means that your system will degrade. In my experience, the intervals between good performance and severe degradation become shorter over time. Eventually , you reach a point that you don’t have the money or staff to get the system back on track.

Exogenous Factors

Two factors over which neither you, the vendor, nor your IT team have any control. Let’s look at each of these. First, users want more features. Some may be of great urgency; for example, setting up alerts to track a new regulation that affects your company’s ability to launch new products. You have to make your search and content processing system acquire, filter, and package new information streams. Your knowledge of these streams, by definition, is limited. You have not worked with them before, and it is difficult to understand certain issues until you are “up and running.” Therefore, you add new features, additional content, and set up some type of alert service. None of these is rocket science, but three occurring in a short span of time almost guarantees that search and content processing performance will degrade.

Second, vendors enhance their search and content processing systems. Keep in mind that what the vendor tells you about an upgrade or a new version may not be what you need to know for your installation. A vendor can implement some whizzy new content processing feature, and your system simply chokes. A call to the vendor evokes, “Well, we haven’t encountered that issue before. I can arrange for one of our engineers to contact you.” Alternatively, you may hear from your IT department, “Okay, we’ll look at the problem, but we may not be able to do anything about it right away.”

Now you are in the middle of the time arc. Each hour, day, or week that you are not running the system, you are falling further behind in your basic indexing function. At some point, two or more issues coincide, and you are unable to restore the system to its earlier performance levels.

What’s the Fix?

Surprisingly most of the organizations with this problem choose to do nothing about performance. The reason is that “good enough” response time may be measured in spans of a minute, maybe two or more from the time the query is launched until the results come back. When the system slows down so much, users find other ways to get answers. When usage drops off, system performance improves. You and I both know that senior management has no clue about search and content processing, so it’s unlikely that a senior vice president pulls a deus ex machina. You live with what you have.

Other organizations are sufficiently addled by a million dollar system that doesn’t work after a year of fiddling that the “rip and replace” approach becomes the only way out. How often does this severe, expensive, and ultimately temporary fix get used? I would suggest that when a major search vendor announces a major contract win, you are dealing with a knee jerk, rip it out, and start over situation. Little wonder that some vendors share half their customers with their competitors. When it’s not possible to “rip and replace”, some organizations simply pay for two systems and operate them until it’s time to look for a third option. No wonder most Fortune 500 companies have five or more “enterprise search” systems.

The most mesmerizing approach is to watch an organization “fix” an existing system. Sometimes the IT department–actually one or two people who “studied” search at university and an intern–tackle this job. The system that pops out of the IT “oven” is usually pretty much like the system before it became inoperable. “Fixes” are much more than the Havana, Cuba approach to auto repair. Do what you need to do to get that 1954 Ford running. Some of these home brews never work, so the organization jumps to a rip and replace approach or looks the other way while a single department buys its own search solution, maybe a Google Search Appliance.

What to Do?

I certainly can’t tell you what to do. I have no idea what type of search and content processing swamp you have in your backyard. The main point is to recognize that this search system degradation plagues any system.

Let me offer two or three ideas, and you can get more if you have a copy of Beyond Search: What to Do When Your Search System Won’t Work?

First, plan for degradation. Yes, sit down and make a list of actions that must occur every month once you get your system installed. Some of the items on this time table are [a] study system logs for signs of trouble’ [b] monitor the volume, size, and number of documents the system cannot process; and [c] install upgrades and enhancements on a staging server, test the modified system, and then upgrade the production system.

Second, make sure you know how to [a] restore the search and content processing system to a known good state before you have to use the restored version and [b] reindex previously processed content quickly. Hint: think repository.

Third, deal with reality. Understand that it is human nature for your IT department to assert its vast knowledge and expertise. Trust but verify. Your vendor’s first line of defense will be a person who attended a sales seminar and learned, “Be happy. Be optimistic. Don’t be negative.” No vendor willingly admits its software sucks unless you are married to the developer or have the wizard in thumb screws.

You can deploy a reliable, well-performing system. To achieve this goal, you have to use common sense, constrain customization and certain content, and invest time on an on-going basis to keep your infrastructure up and running.

Now you see that time is your search engine’s deadliest enemy. Time will kill any search system. I’m not pessimistic about search, just pragmatic based on my real life experiences.

Stephen Arnold, April 8, 2008

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Feature

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.