The Content Acquisition Hot Spots

February 21, 2008

I want to take a closer look at behind-the-firewall search system bottle necks. This essay talks about the content acquisition hot spot. I want to provide some information, but I will not go into the detail that appears in Beyond Search.

Content acquisition is a core function of a search system. “Classic” search systems are designed to pull content from a server where the document resides to the storage device the spider uses to hold the new or changed content. Please, keep in mind that you will make a copy of a source document, move it over the Intranet to the spider, and store that content object on the storage device for new or changed content. The terms crawling or spidering have been used since 1993 to describe the processes for:

  • Finding new or changed information on a server or in a folder
  • Copying that information back to the search system or the crawler sub system
  • Writing information about the crawlers operation to the crawler log file.

On the surface, crawling seems simple. It’s not. Crawlers or spiders require configuration. Most vendors provide a browser-based administrative “tool” that makes is relatively easy to configure the most common settings. For example, you will want to specify how often the content acquisition sub system checks for new or changed content. You also have to “tell” the crawling sub system what servers, computers, directories, and files to acquire. In fact, the crawling sub system has a wide range of settings. Many systems allow you to provide create “rules” or special scripts to handle certain types of content; for example, you can set a specific schedule for spidering for certain servers or folders.

In the last three or four years, more search systems have made it easier for the content acquisition system to receive “pushed” content. The way “push” works is that you write a script or use the snippet of code provided by the search vendor to take certain content and copy it to a specific location on the storage device where the spider’s content resides. I can’t cover the specifics of each vendor’s “push” options, but you will find the details in the Help files, API documentation, or the FAQs for your search system.

Pull

Pull works pretty well when you have a modest amount of new or changed content every time slice. You determine the time interval between spider runs. You can make the spider aggressive and launch the sub system every 60 seconds. You can relax the schedule and check for changed content every seven days. In most organizations, running crawlers every minute can suck up available network bandwidth and exceed the capacity of the server or servers running the crawler sub system.

You now have an important insight into the reason the content acquisition sub system can become a hot spot. You can run out of machine resources, so you will have to make the crawler less aggressive. Alternatively, you can saturate the network and the crawler sub system by bringing back more content than your infrastructure can handle. Some search systems bring back content that exceeds available storage space. Your choices are stark — limit the number of servers and folders the crawling sub system indexes.

When you operate a behind-the-firewall search system, you don’t have the luxury a public Web indexing engine has. These systems can easily skip a server that times out or not revisit a server until the next spidering cycle. In an organization, you have to know what much be indexed immediately or as close to immediately as you can get. You have to acquire content from servers that may time out.

The easy fixes for crawler sub system problems are likely to create some problems for users. Users don’t understand why a document may not be findable in the search system. The reason may be that the crawler subsystem was not able to get the document back to the search system for many different reasons. Believe me, users don’t care.

The key to avoiding problems with traditional spidering boils down to knowing how much new and changed content your crawler sub system must handle at peak loads. You also must know the rate of growth for new and changed content. You need the first piece of information to specify the hardware, bandwidth, storage, and RAM you need for the server or servers handling content acquisition. The second data point gives you the information you need to upgrade your content acquisition system. You have to keep the content acquisition system sufficiently robust to handle the ever-larger amount of information generated in organizations today.

The cause of a hot spot in content acquisition is due to:

  • Insufficient resources
  • Failure to balance crawler aggressiveness with machine resources
  • Improper handling of high-latency response from certain systems whose content must be brought back to the search storage sub system for indexing.

The best fix is to do the up front work accurately and thoroughly. To prevent problems from happening, a proactive upgrade path must be designed and implemented. Routine maintenance and tuning must be routine operations, not “we’ll do it later” procedures.

Push

Push is another way to reduce the need for the content acquisition sub system to “hit” the network at inopportune times. The idea is simple, and it is designed to operate in a way directly opposite from the “pull” service that gave content “pull” a bad reputation. PointCast “pulled” content indiscriminately, causing network congestion.

The type of “pull” I am discussing is a fall out of the document inventory conducted before you deploy the first spider. You want to identify those content objects that can be copied from their host location to the content acquisition storage sub system using a crontab file or a script that triggers the transfer when [a] the new or changed data are available and [b] at off-peak times.

The idea is to keep the spiders from identifying certain content objects and then moving those files from their host location to the crawler storage device at inopportune moments.

In order to make “push” work, you need to know which content is a candidate for routine movement. You have to set up the content acquisition system to receive “pushed” content, which is usually handled via the graphical administrative interface. You need to create the script or customize the vendor-provided function to “wake up” when new of changed content arrives in a specific folder on the machine hosting the content. Then the script consults the rules for starting the “push”. The transfer occurs and the script should verify in some way that the “pushed” file was received without errors.

Many vendors of behind-the-firewall search systems support “push”. If your system does not, you can use the API to create this feature. While not trivial, a custom “push” function is a better solution than trying to get a crashed content acquisition sub system back online. You run the risk of having to reacquire the content, which can trigger another crash or saturate the network bandwidth despite your best efforts to prevent another failure.

Why You Want to Use Both Push and Pull

The optimal content acquisition sub system will use both “push” and “pull” techniques. Push can be very effective for high-priority content that must be indexed without waiting for the crawler to run a CRC, time stamp, or file size check on content.

The only way to make the most efficient use of your available resources is to designate certain content as “pull” and other content as “push”. You cannot guess. You must have accurate baseline data and update those data by consulting the crawler logs.

You will want to develop schedules for obtaining new and changed content via “push” and “pull”. You may want to take a look at the essay on this Web log about “hit boosting”, a variation on “push” content with some added zip to ensure that certain information appears in the context you want it to show up.

Where Are the Hot Spots?

If you have a single server and your content acquisition function chokes, you know the main hot spot — available hardware. You should place the crawler sub system on a separate server or servers.

The second hot spot may be the network bandwidth or lack of it when you are running the crawlers and pushing data to the content acquisition sub system. If you run out of bandwidth, you face some specific choices. No choice is completely good or bad. The choices are shades of gray; that is, you must make trade offs. I will hightly three, and you can work through the others yourself.

First, you can acquire less content less frequently. This reduces network saturation, but it increases the likelihood that users will not find the needed information. How can they? The information has not yet been brought to the search system for document processing.

Second, you can shift to “push”, de emphasizing “pull” or traditional crawling. The upside is that you can control how much content you move and when. The downside is that you may inadvertently saturate the network when you are “pushing”. Also, you will have to do the research to know what to “push” and then you have to set up or code, configure, test, debug, and deploy the system. If people have to move the content to the folder the “push” script uses, you will need to do some “human engineering”. It’s better to automate the “push” function in so far as possible.

Third, you have to set up a re-crawl schedule. Skipping servers may not be an option in your organization. Of course, if no one notices missing content, you can take your chances. I suggest knuckling down and doing the job correctly the first time. Unfortunately, short cuts and outright mistakes are very common in the content acquisition piece of the puzzle.

In short, hot spots can crop up in the crawler sub system. The causes may be human, configuration, infrastructure, or a combination of causes.

Is This Really a Big Deal?

Vendors will definitely tell you the content acquisition sub system is no big deal. You may be told, “Look we have optimized our crawler to avoid these problems” or “Look, we have made push a point-and-click option. Even my mom can set this up.”

Feel free to believe these assurances. Let me close with an anecdote. Judge for yourself about the importance of staying on top of the content acquisition sub system.

The setting is a large US government agency. The users of the system were sending search requests to an Intranet Web server. The server would ingest the request and output a list of results. No one noticed that the results were incomplete. An audit revealed that the content acquisition sub system was not correctly identifying changed content. The error caused more than four million reports to be incorrect. Remediation cost more than $10 million. Upon analyzing the problem, facts came to light that the crawler was incorrectly configured when the system was first installed, almost 18 months before the audit. In addition to the money lost, certain staff were laterally arabesqued. Few in Federal employ get fired.

Pretty exciting for a high-profile vendor, a major US agency, and the “professionals” who created this massive problem.

Now, how important is your search system’s content acquisition sub system to you?

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta