Indexing Hot Spots
February 29, 2008
Introduction
This is the third in a series of cost hot spots in behind-the-firewall search. This essay does not duplicate the information in Beyond Search, my new study for the Gilbane Group. This document is designed to highlight several functions or operations in an indexing subsystem than can cause system slow downs or bottlenecks. No specific vendors’ systems are referenced in this essay. I see no value in finger pointing because no indexing subsystem is without potential for performance degradation in a real world installation. – Stephen Arnold, February 29, 2008
Indexing: Often a Mysterious Series of Multiple Operations
One of the most misunderstood parts of a behind-the-firewall search system is indexing. The term indexing itself is the problem. For most people, an index is the key word listing that appears at the back of a book. For those hip to the ways of online, indexing means metatagging, usually in the form of a series of words or phrases assigned to a Web page or an element in a document; for example, an image and its associated test. The actual index in your search system may not be one data table. The index may be multiple tables or numeric values that “float” mysteriously within the larger search system. The “index” may not even be in one system. Parts of the index are in different places, updated in a series of processes that cannot be easily recreated after a crash, software glitch, or other corruption. This CartoonStock.com image makes clear the impact of a search system crash.
Centuries ago, people lucky enough to have a “book” learned that some sort of system was needed to find a scroll, stored in a leather or clay tube, sometimes chained to the wall to keep the source document from wandering off. In the so called Dark Ages, information was not free, nor did it flow freely. Information was something special and of high value. Today, we talk about information as a flood, a tidal wave, a problem. It is ubiquitous, without provenance, and digital. Information wants to be free, fluid, moving around, and unstable, dynamic. For indexing to work, you have a specific object at a point in time to process; otherwise, the index is useless. Also, the index must be “fresh”. Fresh means that the most recent information is in the system and therefore available to users. With lots of new and changed information, you have to determine how fresh is fresh enough. Real time data also provides a challenge. If your system can index 100 megabytes a minute and to keep up with larger volumes of new and changed data, something’s got to give. You may have to prioritize what you index. You handle high-priority documents first, then shift to lower priority document until new higher-priority documents arrive. This triage affects the freshness in the index or you can throw more hardware at your system, thus increasing capital investment and operational cost.Index freshness is important. A person in a professional setting cannot do “work” unless the digital information can be located. Once located, the information must be the “right” information. Freshness is important, but there are issues of versions of documents. These are indexing challenges and can require considerable intellectual effort to resolve. You have to get freshness right for a search system to be useful to your colleagues. In general, the more involved your indexing, the more important is the architecture and engineering of the “moving parts” in your search system’s indexing subsystem.Why is indexing a cost hot spot? Let’s look at some hot spots I have encountered in the last nine months.
Remediating Indiscriminate Indexing
When you deploy your behind-the-firewall search or content processing system, you have to tell your system how to process the content. You can operate an advanced system in default mode, but you may want to select certain features, level of stringency, and make sure that you are familiar with the various controls available to you. Investing time prior to deployment in testing may be useful when troubleshooting. The first cost hot spot is encountering disc thrashing or long indexing times. You come in one morning, check the logs, and learn no content was processed. In Beyond Search I talk about some steps you can take to troubleshoot this condition. If you can’t remediate the situation by rebooting the indexing subsystem, then you will have to work through the vendor’s technical support group, restore the system to a known good state, or – in some cases – reinstall the system. When you reinstall, some systems cannot use the back up index files. If you find that your back ups won’t work or deliver erratic results on test queries, then you may have to rebuild the index. In a small two person business, the time and cost are trivial. In an organization with hundreds of servers, the process can consume significant resources.
Updating the Index or Indexes
Your search or content processing system allows you to specify how frequently the index updates. When your system has robust resources, you can specify indexing to occur as soon as content becomes available. Some vendors talk about their systems as “real time” indexing engines. If you find that your indexing engine starts to slow down, you may have encountered a “big document” problem. Indexing systems make short work of HTML pages, short PDFs, and emails. But when document size grows, the indexing subsystem needs more “time” to process long documents. I have encountered situations in which a Word document includes objects that are large. The Word document requires the indexing subsystem to grind away on this monster file. If you hit a patch characterized by a large number of big documents, the indexing subsystem will appear to be busy but indexing subsystem outputs fall sharply.Let’s assume you build your roll out index based on a thorough document analysis. You have verified security and access controls so the “right” people see the information to which they have appropriate access. You know that the majority of the documents your system processes are in the 600 kilobyte range over the first three months of indexing subsystem operation. Suddenly the document size leaps to six megabytes and the number of big documents becomes more than 20 percent of the document throughput. You may learn that the set up of your indexing subsystem or the resources available are hot spots.Another situation concerns different versions of documents. Some search and content processing systems identify duplicates using date and time stamps. Other systems include algorithms to identify duplicate content and remove it or tag it so the duplicates may or may not be displayed under certain conditions. A surge in duplicates may occur when an organization is preparing for a trade show. Emails with different versions of a PowerPoint may proliferate rapidly. Obviously indexing every six megabyte PowerPoint makes sense if each PowerPoint is different. How your indexing subsystem handles duplicates is important. A hot spot occurs when a surge in the number of files with the same name and different date and time stamps are fed into the indexing system. The hot spot may be remediated by identifying the problem files and deleting them manually or via your system’s administrative controls. Versions of documents can become an issue under certain circumstances such as a legal matter. Unexpected indexing subsystem behavior may be related to a duplicate file situation.Depending on your system, you will have some fiddling to do in order to handle different versions of documents in a way that makes sense to your users. You also have to set up a de-duplication process in order to make it easy for your users to find the specific version of the document needed to perform a work task. These administrative interventions are not difficult when you know where to look for the problem. If you are not able to pinpoint a specific problem, the hunt for the hot spot can become time consuming.
Common Operations Become a Problem
Once an index has been constructed – a process often called indexation – incremental updates are generally trouble free. Notice that I said generally. Let’s look at some situations that can arise, albeit infrequently.Index RebuildYou have a crash. The restore operation fails. You have to reindex the content. Why is this expensive? You have to plan reindexing and then baby sit the update. For reindexing you will need the resources required when you performed the first indexation of your content. In addition, you have to work through the normal verifications for access, duplicates, and content processing each time you update. Whatever caused the index restore operation to fail must be remediated, a back up created when reindexing is completed, and then a test run to make sure the new back up restores correctly.Indexing New or Changed ContentLet’s assume that you have a system, and you have been performing incremental indexes for six months with no observable problems and no red flags from users. Users with no prior history of complaining about the search system complain that certain new documents are not in the system. Depending on your search system’s configuration, you may have a hot spot in the incremental indexing update process. The cause may be related to volume, configuration, or an unexpected software glitch. You need to identify the problem and figure out a fix. Some systems maintain separate indexes based on a maximum index size. When the index grows beyond a certain size, the system creates or allows the system administrator to create a second index. Parallelization makes it possible to query index components with no appreciable increase in system response time. A hot spot can result when a configuration error causes an index to exceed its maximum size, halting the system or corrupting the index itself, although other symptoms may be observable. Again – the key to resolving this hot spot is often configuration and infrastructure.Value-Added Content ProcessingNew search and content processing systems incorporate more sophisticated procedures, systems, and methods than systems did a few years ago. Fortunately faster processors, 64-bit chips, and plummeting prices for memory and storage devices allows indexing systems to pile on the operations and maintain good indexing throughput, easily several megabytes a minute to five gigabytes of content per hour or more.If you experience slow downs in index updating, you face some stark choices when you saturate your machine capacity or storage. In my experience, these are:
- Reduce the number of documents processed
- Expand the indexing infrastructure; that is, throw hardware at the problem
- Turn off certain resource intensive indexing operations; in effect, eliminating some of the processes that use statistical, linguistic, or syntactic functions.
One of the questions that comes up frequently is, “Why are value-added processing systems more prone to slow downs?” The answer is that when the number of documents processed goes up or the size of documents rises, the infrastructure cannot handle the load. Indexing subsystems require constant monitoring and routine hardware upgrades.Iterative systems cycle through processes two or more times.Some iterative functions are dependent on other processes; for example, until the linguistic processes complete, another component – for example, entity extraction – cannot be completed. Many current indexing systems are be parallelized. But situations can arise in which indexing slows to a crawl because a software glitch fails to keep the internal pipelines flowing smoothly. If process A slows down, the lack of available data to process means process B waits. Log analysis can be useful in resolving this hot spot.Crashes: Still OccurMany modern indexing systems can hiccup and corrupt an index. The way to fix a corrupt index is to have two systems. When one fails, the other system continues to function.But many organizations can’t afford tandem operation and hot failovers. When an index corruption occurs, some organizations restore the index to a prior state. A gap may exist between the points in the back up and the index state at the time of the failure. Most systems can determine which content must be processed to “catch up”. Checking the rebuilt indexes is a useful step to take when a crash has taken place and the index restored and rebuilt. Keep in mind that back ups are not fool proof. Test your system’s back up and restore procedures to make sure you can survive a crash and have the system again operational.
Wrap Up
Let’s step back. The hot spots for indexing fall into three categories. First, you have to have adequate infrastructure. Ideally your infrastructure will be engineered to permit pipelined functions to operate rapidly and without latency. Second, you will want to have specific throughput targets so you can handle new and changed content whether your vendor requires one index or multiple indexes. Third, you will want to understand how to recover from a failure and have procedures in place to restore an index or “roll back” to a known good state and then process content to ensure no lost content.In general, the more value added content processing you use, your potential for hot spots increases. Search used to be simpler from an operational point of view. Key word indexing is very straight forward compared to some of the advanced content processing systems in use today. The performance of any system fluctuates to some extent. As sophisticated as today’s systems are, there is room for innovation in system design, architecture, and administration of indexing subsystems. Keep in mind that more specific information appears in Beyond Search, due out in April 2008.
Stephen Arnold, February 29, 2008