The Content Acquisition Hot Spots
February 21, 2008
I want to take a closer look at behind-the-firewall search system bottle necks. This essay talks about the content acquisition hot spot. I want to provide some information, but I will not go into the detail that appears in Beyond Search.
Content acquisition is a core function of a search system. “Classic” search systems are designed to pull content from a server where the document resides to the storage device the spider uses to hold the new or changed content. Please, keep in mind that you will make a copy of a source document, move it over the Intranet to the spider, and store that content object on the storage device for new or changed content. The terms crawling or spidering have been used since 1993 to describe the processes for:
- Finding new or changed information on a server or in a folder
- Copying that information back to the search system or the crawler sub system
- Writing information about the crawlers operation to the crawler log file.
On the surface, crawling seems simple. It’s not. Crawlers or spiders require configuration. Most vendors provide a browser-based administrative “tool” that makes is relatively easy to configure the most common settings. For example, you will want to specify how often the content acquisition sub system checks for new or changed content. You also have to “tell” the crawling sub system what servers, computers, directories, and files to acquire. In fact, the crawling sub system has a wide range of settings. Many systems allow you to provide create “rules” or special scripts to handle certain types of content; for example, you can set a specific schedule for spidering for certain servers or folders.
In the last three or four years, more search systems have made it easier for the content acquisition system to receive “pushed” content. The way “push” works is that you write a script or use the snippet of code provided by the search vendor to take certain content and copy it to a specific location on the storage device where the spider’s content resides. I can’t cover the specifics of each vendor’s “push” options, but you will find the details in the Help files, API documentation, or the FAQs for your search system.
Pull
Pull works pretty well when you have a modest amount of new or changed content every time slice. You determine the time interval between spider runs. You can make the spider aggressive and launch the sub system every 60 seconds. You can relax the schedule and check for changed content every seven days. In most organizations, running crawlers every minute can suck up available network bandwidth and exceed the capacity of the server or servers running the crawler sub system.
You now have an important insight into the reason the content acquisition sub system can become a hot spot. You can run out of machine resources, so you will have to make the crawler less aggressive. Alternatively, you can saturate the network and the crawler sub system by bringing back more content than your infrastructure can handle. Some search systems bring back content that exceeds available storage space. Your choices are stark — limit the number of servers and folders the crawling sub system indexes.
When you operate a behind-the-firewall search system, you don’t have the luxury a public Web indexing engine has. These systems can easily skip a server that times out or not revisit a server until the next spidering cycle. In an organization, you have to know what much be indexed immediately or as close to immediately as you can get. You have to acquire content from servers that may time out.
The easy fixes for crawler sub system problems are likely to create some problems for users. Users don’t understand why a document may not be findable in the search system. The reason may be that the crawler subsystem was not able to get the document back to the search system for many different reasons. Believe me, users don’t care.
The key to avoiding problems with traditional spidering boils down to knowing how much new and changed content your crawler sub system must handle at peak loads. You also must know the rate of growth for new and changed content. You need the first piece of information to specify the hardware, bandwidth, storage, and RAM you need for the server or servers handling content acquisition. The second data point gives you the information you need to upgrade your content acquisition system. You have to keep the content acquisition system sufficiently robust to handle the ever-larger amount of information generated in organizations today.
The cause of a hot spot in content acquisition is due to:
- Insufficient resources
- Failure to balance crawler aggressiveness with machine resources
- Improper handling of high-latency response from certain systems whose content must be brought back to the search storage sub system for indexing.
The best fix is to do the up front work accurately and thoroughly. To prevent problems from happening, a proactive upgrade path must be designed and implemented. Routine maintenance and tuning must be routine operations, not “we’ll do it later” procedures.
Push
Push is another way to reduce the need for the content acquisition sub system to “hit” the network at inopportune times. The idea is simple, and it is designed to operate in a way directly opposite from the “pull” service that gave content “pull” a bad reputation. PointCast “pulled” content indiscriminately, causing network congestion.
The type of “pull” I am discussing is a fall out of the document inventory conducted before you deploy the first spider. You want to identify those content objects that can be copied from their host location to the content acquisition storage sub system using a crontab file or a script that triggers the transfer when [a] the new or changed data are available and [b] at off-peak times.
The idea is to keep the spiders from identifying certain content objects and then moving those files from their host location to the crawler storage device at inopportune moments.
In order to make “push” work, you need to know which content is a candidate for routine movement. You have to set up the content acquisition system to receive “pushed” content, which is usually handled via the graphical administrative interface. You need to create the script or customize the vendor-provided function to “wake up” when new of changed content arrives in a specific folder on the machine hosting the content. Then the script consults the rules for starting the “push”. The transfer occurs and the script should verify in some way that the “pushed” file was received without errors.
Many vendors of behind-the-firewall search systems support “push”. If your system does not, you can use the API to create this feature. While not trivial, a custom “push” function is a better solution than trying to get a crashed content acquisition sub system back online. You run the risk of having to reacquire the content, which can trigger another crash or saturate the network bandwidth despite your best efforts to prevent another failure.
Why You Want to Use Both Push and Pull
The optimal content acquisition sub system will use both “push” and “pull” techniques. Push can be very effective for high-priority content that must be indexed without waiting for the crawler to run a CRC, time stamp, or file size check on content.
The only way to make the most efficient use of your available resources is to designate certain content as “pull” and other content as “push”. You cannot guess. You must have accurate baseline data and update those data by consulting the crawler logs.
You will want to develop schedules for obtaining new and changed content via “push” and “pull”. You may want to take a look at the essay on this Web log about “hit boosting”, a variation on “push” content with some added zip to ensure that certain information appears in the context you want it to show up.
Where Are the Hot Spots?
If you have a single server and your content acquisition function chokes, you know the main hot spot — available hardware. You should place the crawler sub system on a separate server or servers.
The second hot spot may be the network bandwidth or lack of it when you are running the crawlers and pushing data to the content acquisition sub system. If you run out of bandwidth, you face some specific choices. No choice is completely good or bad. The choices are shades of gray; that is, you must make trade offs. I will hightly three, and you can work through the others yourself.
First, you can acquire less content less frequently. This reduces network saturation, but it increases the likelihood that users will not find the needed information. How can they? The information has not yet been brought to the search system for document processing.
Second, you can shift to “push”, de emphasizing “pull” or traditional crawling. The upside is that you can control how much content you move and when. The downside is that you may inadvertently saturate the network when you are “pushing”. Also, you will have to do the research to know what to “push” and then you have to set up or code, configure, test, debug, and deploy the system. If people have to move the content to the folder the “push” script uses, you will need to do some “human engineering”. It’s better to automate the “push” function in so far as possible.
Third, you have to set up a re-crawl schedule. Skipping servers may not be an option in your organization. Of course, if no one notices missing content, you can take your chances. I suggest knuckling down and doing the job correctly the first time. Unfortunately, short cuts and outright mistakes are very common in the content acquisition piece of the puzzle.
In short, hot spots can crop up in the crawler sub system. The causes may be human, configuration, infrastructure, or a combination of causes.
Is This Really a Big Deal?
Vendors will definitely tell you the content acquisition sub system is no big deal. You may be told, “Look we have optimized our crawler to avoid these problems” or “Look, we have made push a point-and-click option. Even my mom can set this up.”
Feel free to believe these assurances. Let me close with an anecdote. Judge for yourself about the importance of staying on top of the content acquisition sub system.
The setting is a large US government agency. The users of the system were sending search requests to an Intranet Web server. The server would ingest the request and output a list of results. No one noticed that the results were incomplete. An audit revealed that the content acquisition sub system was not correctly identifying changed content. The error caused more than four million reports to be incorrect. Remediation cost more than $10 million. Upon analyzing the problem, facts came to light that the crawler was incorrectly configured when the system was first installed, almost 18 months before the audit. In addition to the money lost, certain staff were laterally arabesqued. Few in Federal employ get fired.
Pretty exciting for a high-profile vendor, a major US agency, and the “professionals” who created this massive problem.
Now, how important is your search system’s content acquisition sub system to you?
Search System Bottlenecks
February 21, 2008
In a conference call yesterday (February 19, 2008), one of the well-informed participants asked, “What’s with the performance slow downs in these behind-the-firewall search systems?”
I asked, “Is it a specific vendor’s system?”
The answer, “No, it seems like a more general problem. Have you heard anything about search slow downs on Intranet systems?”
I do hear quite a bit about behind-the-firewall search systems. People find my name on the Internet and ask me questions. Others get a referral to me. I listen to their question or comment and try to pass those with legitimate issues to someone who can help out. I’m not too keen on traveling to a big city, poking into the innards of a system, and trying to figure out what went off track. That’s a job for younger, less jaded folks.
But yesterday’s question got me thinking. I dug around in my files and discovered a dated, but still useful diagram of the major components of a behind-the-firewall search system. Here’s the diagram, which I know is difficult to read, but I want to call your attention to the seven principal components of the diagram and then talk briefly about hot spots. I will address each specific hot spot in a separate Web log post to keep the length manageable.
This essay, then, takes a broad look at the places I have learned to examine first when trying to address a system slow down. I will try to keep the technical jargon and level of detail at a reasonable level. My purpose is to provide you with an orientation to hot spots before you begin your remediation effort.
The Bird’s Eye View of a Typical Search System
Keep in mind that each vendor implements the search sub systems in a way appropriate for their engineering. In general, if you segment the sub systems, you will see a horizontal area in the middle of this diagram surrounded by four key subsystems, the content, and, of course, the user. The search system exists for the user, which many vendors and procurement teams happily overlook.
This diagram has been used in my talks at public events for more than five years. You may use this for your personal use or in educational activities without restrictions. If you want to use it in a publication, please, provide contact me for permission.
Let’s run through this diagram and then identify the hot spots. You see some arrows. These are designed to show the pipeline through which content, queries, and results flow. In several places, you see arrows pointing different directions in close proximity. It is obvious that in these interfaces, a glitch of any type will create a slowdown. Now let’s identify the main features.
In the upper left hand corner is a blue sphere that represents content. For our purpose, let’s just assume that the content resides behind the firewall, and it is the general collection of Word documents, email, and PowerPoints that make up much of an organization’s information. Pundits calculate that 80 percent of an organization’s information is unstructured. My research suggests that the ratio of structured to unstructured data varies sharply by type of organization. For now, let’s just deal with generalized “content”. In the upper right hand corner, you see the user. The user, like the content, can be generalized for our purposes. We will assume that the user navigates to a Web page, sees a search box, or a list of hot links, and enters a query in some way. I don’t want to de emphasize the user’s role in this system, but I want to set aside her needs, the hassle of designing an interface, and other user-centric considerations such as personalization.
Backbone or Framework
Now, let’s look at the horizontal area in the center of the diagram show below:
You can see that there are specific sub systems within this sub system labeled storage clusters. This is the first key notion to keep in mind when thinking about performance of a search system. The problem that manifests itself at an interface may be caused by a sub component in a sub system. Until there’s a problem, you may not have thought about your system as a series of nested boxes. What you want to keep in mind is that until you have a performance bottleneck is that the many complex moving parts were working pretty well. Don’t criticize your system vendor without appreciating how complicated a search system is. These puppies are far from trivial — including the free one you download to index documents on your Mac or PC.
In this rectangle are “spaces” — a single drive or clusters of servers — that hold content returned from the crawling sub system (described below), the outputs of the document processing sub system (described below), the index or indexes, the system that holds the “content” in some proprietary or other representation, and a component to house the “metrics” for the system. Please keep in mind that running analytics is a big job, and you will want to make sure that you have a way to store, process, and manipulate system logs. No system logs — well, then, you are pretty much lost in space when it comes to trouble shooting. One major Federal agency could not process its logs; therefore, usage data and system performance information did not exist. Not good. Not good at all.
The components in this sub system handle content acquisition, usually called crawling or spidering. I want to point out that the content acquisition sub system can be a separate server or cluster of servers. Also, keep in mind that keeping the content acquisition sub system on track requires that you fiddle with rules. Some systems like Google’s search appliance reduce this to a point-and-click exercise. Other systems require command line editing of configuration files. Rules may be check boxes or separate scripts / programs. Yes, you have to write these or pay someone to do the rule fiddling. When the volume of content grows, this sub system can choke. The result is not a slow down, but you may find that some users say, “I put the content in the folder for indexing, and I can’t find the document.” No, the user can’t. It may be snagged in an over burdened content acquisition sub system.
Document Processing / Document Transformation
Let me define what I mean by document processing. I am using this term to mean content normalization and transformation. In Beyond Search, I use the word transformation to stream line the text. In this sub system, I am not discussing indexing the content. I want to move a Word file from its native Word format to a form that can be easily ingested by the indexing sub system described in the next section of this essay.
This sub system pulls or accepts the information acquired by the spidering sub system. Each file is transformed into an representation that the indexed sub system (described below) can understand. Transformation is now a key part of many behind-the-firewall systems. The fruit cake of different document types are normalized; that is, made standard. If a document cannot be manipulated by the system, then that document cannot be indexed. An increasing number of document transformation sub systems store the outputs in an XML format. Some vendors include an XML data base or data management system with their search system. Others use a data base system and keep it buried in the “guts” of their system. This notion of transformation means that disc writes will occur. The use of a data base system “under the hood” may impose some performance penalties on the document processing sub system. Traditional data base management systems can be input – output bound. A bottle neck related to an “under the hood” third-party, proprietary, or open source data base can be difficult to speed up if resources like money for hardware are scarce.
Indexing
Most vendors spend significant time explaining the features and functions of their systems’ indexing. You will hear about semantic indexing, latent semantic indexing, linguistics, and statistical processes. There are very real differences between vendors’ systems. Keep in mind that any indexing sub system is a complicate beastie. Here’s a blow up from the generalized schematic above:
In this diagram, you see knowledge bases, statistical functions, “advanced processes” (linguistics / semantics), and a reference to an indexing infrastructure. Indexing performs much of the “heavy lifting” for a search system, and it is absolutely essential that the indexing sub system be properly resourced. This means bandwidth, CPU cycles, storage, and random access memory. If the indexing sub system cannot keep pace with the amount of information to be indexed and the number of queries passed against the indexes, a number of symptoms become evident to users and the system administrator. I will return to the problems of an overloaded indexing subsystem in a separate essay in a day or two. Note that I have included “manual tagging” in the list of fancy processes. The notion of a fully automatic system, in my experience, is a goal, not a reality. Most indexing systems require over sight by a subject matter expert or indexing specialist. Both statistical and linguistic systems can get “lost in space.” There are many reasons such as language drift, neologisms, and exogenous shifts. The only reliable way to get these indexing glitches resolved is to have a human make the changes to the rules, the knowledge bases, or the actual terms assigned to individual records. Few vendors like to discuss these expensive, yet essential, interventions. Little wonder that many licensees feel snookered when “surprises” related to the indexing sub system become evident and then continue to crop up like dandelion.
Query Processing
Query processing is a variant of indexing. Queries have to be passed against the indexes. In effect, a user’s query is “indexed”. The query is matched or passed against the index, and the results pulled out, formatted, and pushed to the user. I’m not going to talk about stored queries or what used to be called SDI (selective dissemination of information), saved searches, or filters. Let’s just talk about a key word query.
The query processing sub system consists of some pre – and post – processing functions. A heavily-used system requires a robust query processing “front end.” The more users sending queries at the same time, the more important it is to be able to process those queries and get results back in an acceptable time. My tests show that a user of a behind-the-firewall system will wait as much as 15 seconds before complaining. In my tests on systems in 2007, I found an average query response time in the 20 second range, which explains in large part why employees are dissatisfied with their incumbent search system. The dissatisfaction is a result of an inadequate infrastructure for the search system itself. Dissatisfaction, in fact, does not single out a specific vendor. The vendors are equally dissatisfying. The vendors, obviously, can make their systems run faster, but the licensee has the responsibility to provide a suitable infrastructure on which to run the search system. In short, the “dissatisfaction” is a result of poor response time. Only licensees can “fix” this infrastructure problem. Blaming a search vendor for lousy performance is often a false claim. Notice that the functions performed within the query processing sub system are complex; for example, “on the fly” clustering, relevance ranking, and formatting. Some systems include work flow components that shape queries and results to meet the needs of particular employees or tasks. The work flow component then generates the display appropriate for the work task. Some systems “inject” search results into a third-party application so the employee has the needed information on a screen display related to the work task; for instance, a person’s investments or prior contact history.
Back to Hot Spots
Let me reiterate — I am using an older, generalized diagram. I want to identify the complexities within a representative behind-the-firewall search system. The purpose of this exercise is to allow me to comment on some general hot spots as a precursor to a quick look in a subsequent essay about specific bottle necks in subsystems.
The high level points about search system slow downs are:
- A slow down in one part of the system may be caused be a deeper issue. In many cases, the problem could be buried deep within a particular component in a sub system. Glitches in search systems can, therefore, take some time to troubleshoot. In some cases, there may be no “fix”. The engineers will have to “work around” the problem which may mean writing code. Switching to a hosted service or a search appliance may be the easiest way to avoid this problem.
- The slow down may be outside the vendor’s span of control. If you have an inadequate search system infrastructure, the vendor can advise you on what to change. But you will need the capital resources to make the change. Most slow downs in search systems are a result of the licensee’s errors in calculating CPU cycles, storage, bandwidth, and RAM. The cause of this problem is ignorance of the computational burden search systems place on their infrastructure. The fast CPUs are wonderful, but you may need clusters of servers, not one or two servers. The fix is to get outside verification of the infrastructure demands. If you can’t afford the plumbing, shift to a hosted solution or license an appliance.
- A surge in either the amount of content to index or the numbers of queries to process can individually bring a system to a half. When the two coincide, the system will choke, often failing. If you don’t have log data and you don’t review it, you will not know where to begin looking for a problem. The logs are often orphans, and their data are voluminous, hard to process, and cryptic. Get over it. Most organizations have a steady increase in content to be processed and more users sending queries to the search system despite their dissatisfaction with its performance. In this case, you will have a system that will fail and then fail again. The fix is to buckle down, manage the logs, study what’s going on in the sub systems, and act in an anticipatory way. What’s this mean? You will have to continue to build out your system when performance is acceptable. If you wait until something goes wrong, you will be in a very precarious position.,
To wrap up this discussion, you may be reeling from the ill-tasting medicine I have prescribed. Slow downs and hot spots are a fact of life with complex systems such as search. Furthermore, the complexity of the search systems in general and their sub systems in particular are essentially not fully understood by most licensees, their IT colleagues, or their management. In the first three editions of the Enterprise Search Report, I discussed this problem at length. I touch upon it briefly in Beyond Search because it is critical to the success of any search or content processing initiative. If you have different experiences from mind, please, share them via the comments function on this Web log.
I will address specific hot spots in the next day or two.
Stephen Arnold, February 21, 2008