Search System Bottlenecks

February 21, 2008

In a conference call yesterday (February 19, 2008), one of the well-informed participants asked, “What’s with the performance slow downs in these behind-the-firewall search systems?”

I asked, “Is it a specific vendor’s system?”

The answer, “No, it seems like a more general problem. Have you heard anything about search slow downs on Intranet systems?”

I do hear quite a bit about behind-the-firewall search systems. People find my name on the Internet and ask me questions. Others get a referral to me. I listen to their question or comment and try to pass those with legitimate issues to someone who can help out. I’m not too keen on traveling to a big city, poking into the innards of a system, and trying to figure out what went off track. That’s a job for younger, less jaded folks.

But yesterday’s question got me thinking. I dug around in my files and discovered a dated, but still useful diagram of the major components of a behind-the-firewall search system. Here’s the diagram, which I know is difficult to read, but I want to call your attention to the seven principal components of the diagram and then talk briefly about hot spots. I will address each specific hot spot in a separate Web log post to keep the length manageable.

This essay, then, takes a broad look at the places I have learned to examine first when trying to address a system slow down. I will try to keep the technical jargon and level of detail at a reasonable level. My purpose is to provide you with an orientation to hot spots before you begin your remediation effort.

The Bird’s Eye View of a Typical Search System

Keep in mind that each vendor implements the search sub systems in a way appropriate for their engineering. In general, if you segment the sub systems, you will see a horizontal area in the middle of this diagram surrounded by four key subsystems, the content, and, of course, the user. The search system exists for the user, which many vendors and procurement teams happily overlook.

This diagram has been used in my talks at public events for more than five years. You may use this for your personal use or in educational activities without restrictions. If you want to use it in a publication, please, provide contact me for permission.

Let’s run through this diagram and then identify the hot spots. You see some arrows. These are designed to show the pipeline through which content, queries, and results flow. In several places, you see arrows pointing different directions in close proximity. It is obvious that in these interfaces, a glitch of any type will create a slowdown. Now let’s identify the main features.

In the upper left hand corner is a blue sphere that represents content. For our purpose, let’s just assume that the content resides behind the firewall, and it is the general collection of Word documents, email, and PowerPoints that make up much of an organization’s information. Pundits calculate that 80 percent of an organization’s information is unstructured. My research suggests that the ratio of structured to unstructured data varies sharply by type of organization. For now, let’s just deal with generalized “content”. In the upper right hand corner, you see the user. The user, like the content, can be generalized for our purposes. We will assume that the user navigates to a Web page, sees a search box, or a list of hot links, and enters a query in some way. I don’t want to de emphasize the user’s role in this system, but I want to set aside her needs, the hassle of designing an interface, and other user-centric considerations such as personalization.

Backbone or Framework

Now, let’s look at the horizontal area in the center of the diagram show below:

You can see that there are specific sub systems within this sub system labeled storage clusters. This is the first key notion to keep in mind when thinking about performance of a search system. The problem that manifests itself at an interface may be caused by a sub component in a sub system. Until there’s a problem, you may not have thought about your system as a series of nested boxes. What you want to keep in mind is that until you have a performance bottleneck is that the many complex moving parts were working pretty well. Don’t criticize your system vendor without appreciating how complicated a search system is. These puppies are far from trivial — including the free one you download to index documents on your Mac or PC.

In this rectangle are “spaces” — a single drive or clusters of servers — that hold content returned from the crawling sub system (described below), the outputs of the document processing sub system (described below), the index or indexes, the system that holds the “content” in some proprietary or other representation, and a component to house the “metrics” for the system. Please keep in mind that running analytics is a big job, and you will want to make sure that you have a way to store, process, and manipulate system logs. No system logs — well, then, you are pretty much lost in space when it comes to trouble shooting. One major Federal agency could not process its logs; therefore, usage data and system performance information did not exist. Not good. Not good at all.

The components in this sub system handle content acquisition, usually called crawling or spidering. I want to point out that the content acquisition sub system can be a separate server or cluster of servers. Also, keep in mind that keeping the content acquisition sub system on track requires that you fiddle with rules. Some systems like Google’s search appliance reduce this to a point-and-click exercise. Other systems require command line editing of configuration files. Rules may be check boxes or separate scripts / programs. Yes, you have to write these or pay someone to do the rule fiddling. When the volume of content grows, this sub system can choke. The result is not a slow down, but you may find that some users say, “I put the content in the folder for indexing, and I can’t find the document.” No, the user can’t. It may be snagged in an over burdened content acquisition sub system.

Document Processing / Document Transformation

Let me define what I mean by document processing. I am using this term to mean content normalization and transformation. In Beyond Search, I use the word transformation to stream line the text. In this sub system, I am not discussing indexing the content. I want to move a Word file from its native Word format to a form that can be easily ingested by the indexing sub system described in the next section of this essay.

This sub system pulls or accepts the information acquired by the spidering sub system. Each file is transformed into an representation that the indexed sub system (described below) can understand. Transformation is now a key part of many behind-the-firewall systems. The fruit cake of different document types are normalized; that is, made standard. If a document cannot be manipulated by the system, then that document cannot be indexed. An increasing number of document transformation sub systems store the outputs in an XML format. Some vendors include an XML data base or data management system with their search system. Others use a data base system and keep it buried in the “guts” of their system. This notion of transformation means that disc writes will occur. The use of a data base system “under the hood” may impose some performance penalties on the document processing sub system. Traditional data base management systems can be input – output bound. A bottle neck related to an “under the hood” third-party, proprietary, or open source data base can be difficult to speed up if resources like money for hardware are scarce.

Indexing

Most vendors spend significant time explaining the features and functions of their systems’ indexing. You will hear about semantic indexing, latent semantic indexing, linguistics, and statistical processes. There are very real differences between vendors’ systems. Keep in mind that any indexing sub system is a complicate beastie. Here’s a blow up from the generalized schematic above:

In this diagram, you see knowledge bases, statistical functions, “advanced processes” (linguistics / semantics), and a reference to an indexing infrastructure. Indexing performs much of the “heavy lifting” for a search system, and it is absolutely essential that the indexing sub system be properly resourced. This means bandwidth, CPU cycles, storage, and random access memory. If the indexing sub system cannot keep pace with the amount of information to be indexed and the number of queries passed against the indexes, a number of symptoms become evident to users and the system administrator. I will return to the problems of an overloaded indexing subsystem in a separate essay in a day or two. Note that I have included “manual tagging” in the list of fancy processes. The notion of a fully automatic system, in my experience, is a goal, not a reality. Most indexing systems require over sight by a subject matter expert or indexing specialist. Both statistical and linguistic systems can get “lost in space.” There are many reasons such as language drift, neologisms, and exogenous shifts. The only reliable way to get these indexing glitches resolved is to have a human make the changes to the rules, the knowledge bases, or the actual terms assigned to individual records. Few vendors like to discuss these expensive, yet essential, interventions. Little wonder that many licensees feel snookered when “surprises” related to the indexing sub system become evident and then continue to crop up like dandelion.

Query Processing

Query processing is a variant of indexing. Queries have to be passed against the indexes. In effect, a user’s query is “indexed”. The query is matched or passed against the index, and the results pulled out, formatted, and pushed to the user. I’m not going to talk about stored queries or what used to be called SDI (selective dissemination of information), saved searches, or filters. Let’s just talk about a key word query.

The query processing sub system consists of some pre – and post – processing functions. A heavily-used system requires a robust query processing “front end.” The more users sending queries at the same time, the more important it is to be able to process those queries and get results back in an acceptable time. My tests show that a user of a behind-the-firewall system will wait as much as 15 seconds before complaining. In my tests on systems in 2007, I found an average query response time in the 20 second range, which explains in large part why employees are dissatisfied with their incumbent search system. The dissatisfaction is a result of an inadequate infrastructure for the search system itself. Dissatisfaction, in fact, does not single out a specific vendor. The vendors are equally dissatisfying. The vendors, obviously, can make their systems run faster, but the licensee has the responsibility to provide a suitable infrastructure on which to run the search system. In short, the “dissatisfaction” is a result of poor response time. Only licensees can “fix” this infrastructure problem. Blaming a search vendor for lousy performance is often a false claim. Notice that the functions performed within the query processing sub system are complex; for example, “on the fly” clustering, relevance ranking, and formatting. Some systems include work flow components that shape queries and results to meet the needs of particular employees or tasks. The work flow component then generates the display appropriate for the work task. Some systems “inject” search results into a third-party application so the employee has the needed information on a screen display related to the work task; for instance, a person’s investments or prior contact history.

Back to Hot Spots

Let me reiterate — I am using an older, generalized diagram. I want to identify the complexities within a representative behind-the-firewall search system. The purpose of this exercise is to allow me to comment on some general hot spots as a precursor to a quick look in a subsequent essay about specific bottle necks in subsystems.

The high level points about search system slow downs are:

A slow down in one part of the system may be caused be a deeper issue. In many cases, the problem could be buried deep within a particular component in a sub system. Glitches in search systems can, therefore, take some time to troubleshoot. In some cases, there may be no “fix”. The engineers will have to “work around” the problem which may mean writing code. Switching to a hosted service or a search appliance may be the easiest way to avoid this problem.
The slow down may be outside the vendor’s span of control. If you have an inadequate search system infrastructure, the vendor can advise you on what to change. But you will need the capital resources to make the change. Most slow downs in search systems are a result of the licensee’s errors in calculating CPU cycles, storage, bandwidth, and RAM. The cause of this problem is ignorance of the computational burden search systems place on their infrastructure. The fast CPUs are wonderful, but you may need clusters of servers, not one or two servers. The fix is to get outside verification of the infrastructure demands. If you can’t afford the plumbing, shift to a hosted solution or license an appliance.
A surge in either the amount of content to index or the numbers of queries to process can individually bring a system to a half. When the two coincide, the system will choke, often failing. If you don’t have log data and you don’t review it, you will not know where to begin looking for a problem. The logs are often orphans, and their data are voluminous, hard to process, and cryptic. Get over it. Most organizations have a steady increase in content to be processed and more users sending queries to the search system despite their dissatisfaction with its performance. In this case, you will have a system that will fail and then fail again. The fix is to buckle down, manage the logs, study what’s going on in the sub systems, and act in an anticipatory way. What’s this mean? You will have to continue to build out your system when performance is acceptable. If you wait until something goes wrong, you will be in a very precarious position.,

To wrap up this discussion, you may be reeling from the ill-tasting medicine I have prescribed. Slow downs and hot spots are a fact of life with complex systems such as search. Furthermore, the complexity of the search systems in general and their sub systems in particular are essentially not fully understood by most licensees, their IT colleagues, or their management. In the first three editions of the Enterprise Search Report, I discussed this problem at length. I touch upon it briefly in Beyond Search because it is critical to the success of any search or content processing initiative. If you have different experiences from mind, please, share them via the comments function on this Web log.

I will address specific hot spots in the next day or two.

Stephen Arnold, February 21, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.