MicroStrategy: TSA Swims through Data with PIMS

July 24, 2008

Government information technology makes me perspire. When a government news item renders in my news reader, I don’t pause. I want to make an exception. MicroStrategy is a very intriguing company. The fact that the firm has ramped its services to a law enforcement agency is interesting. MicroStrategy has been working with TSA since 2004. The deal signed in 2006 has saved TSA more than $100 million. The sentence that caught my attention was:

The TSA is a metrics-based organization… We [the TSA] use metrics every day to drive our decision making and quantify security effectiveness, operational efficiency and workforce management.

An example of this metrics focus is that since 2004, the TSA uses PIMS to run one million reports per year. TSA has about 12,000 users of the system. Each user prints about two reports a week. TSA is right in line with the Office of Management & Budget’s guidelines for managers to make decisions based on hard data, not hunches.

MicroStrategy, as you may know, popped in and out of the news in the 2000-2002 period. One of the items I recalled reading is here. Some former MicroStrategy professionals founded Clarabridge, a company focused on the overlap between business intelligence and content processing. You can find information about that company here.

I want to pay closer attention to MicroStrategy. Companies that can help Federal agencies save $100 million are the taxpayers’ best friends. I am interested in the MicroStrategy – Clarabridge alignment as well. Off to the library in the morning to find what I can find.

Stephen Arnold, July 24, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, News, Online (general), Text processing | 2 Comments

Googzilla Swallows Telegraph Media Group

July 24, 2008

Traditional media has been my favorite whipping boy for a long time. The Telegraph Media Group may force me to rethink my critical view of companies who write stuff, print of dead trees, and employ folks at near starvation wages to get the messy artifacts to a declining readership. Silicon.com reported here that the publisher of The Daily Telegraph, Sunday Telegraph and Weekly Telegraph, as well as the telegraph.co.uk Web site will standardize on Google Apps–word processing, mail, collaboration. The whole shooting match.

My reading of the announcement suggests that TMG did the math and calculated that it could save a bundle. More significantly, TMG lets Google worry about software, presumably so the newspapers can worry about selling adverts. The most interesting statement in the Silicon.com write up is this remark attributed to one of TMG’s managers:

We see the levels of innovation happening in the consumer space…you can actually take advantage of within the enterprise space.

Microsoft, among other traditional software companies, are going to learn first hand how fissionable material goes critical. A few things happen, then a few more things, and then the game changes. Is Google Apps ready to go critical?

My view: yes.

Stephen Arnold, July 24, 2008

Written by Stephen E. Arnold · Filed Under Business strategy, Cloud computing, Cost, Google, News, Online (general), Technology | Comments Off on Googzilla Swallows Telegraph Media Group

Microsoft’s Vietnam: Search

July 21, 2008

What an interesting idea. ZDNet posted a short item that caught my attention on this 95 degree Sunday in rural Kentucky. Larry Dignan’s “Microsoft’s Search Ambitions Are Its Vietnam” appeared on the ZDNet Web logs on July 18, 2008. I suggest you read the item here. Mr. Dignan has opened a new line of analysis about the Microsoft – Google face off.

The key point in the piece for me was:

The online services business lost $1.23 billion for the fiscal year ending June 30. I [Mr. Dignan] quipped that it’s no wonder that Microsoft is so hot for Yahoo. Something has to save this online business. And what’s startling about that figure is that Microsoft only lost $732 million in 2007. Microsoft’s online services business was actually profitable in 2005.

Mr. Digan is spot on. One point warrants further comment, however. The cost of catching Google may be beyond Microsoft’s reach. Here’s why?

Google continues to press forward and Microsoft’s efforts to catch Google seem to be focused on where Google was in late 2006 or early 2007. A leap not a catch up is needed.
Time is working for Google and against Microsoft. Each month Google continues to increase its lead in Web search. At some point, Google will dominate the market, which means that the race may be over for Web queries and online advertising.
Google is “seeping” into the enterprise. Microsoft seems confident that its three big revenue producers will fund the online battle with Google. I think the complexity of products like SharePoint will open the door to newcomers, not necessarily Google, by the way. Any revenue loss increases Microsoft’s vulnerability.

Agree? Disagree? Let me know your thoughts.

Stephen Arnold, July 21, 2008

Written by Stephen E. Arnold · Filed Under Cost, Google, Microsoft, News, Online (general), Search | Comments Off on Microsoft’s Vietnam: Search

Microsoft: 1999 to 2008

July 14, 2008

I have written one short post and two longer posts about Microsoft.com’s architecture for its online services. You can read each of these essays by clicking on the titles of the stories:

I want to urge each of my two or three Web log readers to validate my assertions. Not only am I an addled goose, I am an old goose. I make errors as young wizards delight in reminding me. On Friday, July 11, 2008, two of my engineers filled some gaps in my knowledge about X++, one of Microsoft’s less well-known programming languages.

The diagram shows how complexity increases when systems are designed to support solutions that do not simplify the design. Source: http://www.epmbook.com/complexity.gif

Stepping Back

As I reflected upon the information I reviewed pertaining to Microsoft.com’s online architecture, several thoughts bubbled to the surface of my consciousness:

First, I believe Microsoft’s new data centers and online architecture shares DNA with those 1999 data centers. Microsoft is not embracing the systems and methods in use at Amazon, Google, and even the hapless Yahoo. Microsoft is using its own “dog food”. While commendable, the bottlenecks have not been fully resolved. Microsoft uses scale up and scale out to make systems keep pace with user expectations of response time. One engineer who works at a company competing with Microsoft told me: “Run a query on Live.com. The response times in many cases are faster than our. The reason is that Microsoft caches everything. It works, but it is expensive.”

Second, Microsoft lacks a cohesive code base and a new one. With each upgrade, legacy code and baked in features and functions are dragged along. A good example is SQL Server. Although rewritten from the good old days with Sybase, SQL Server is not the right tool for peta-scale data manipulation chores. Alternatives exist and Amazon and Yahoo are using them. Microsoft is sticking with its RDBMS engine, and it is very expensive to replicate, cluster, back up with stand by hardware, and keep in sync. The performance challenge remains even though user experience seems as good if not better than the competition’s. In my opinion, the reliance on this particular “dog food” is akin to building a wooden power boat with unseasoned wood.

Third, in each of the essays, Microsoft’s own engineers emphasize the cost of the engineering approaches. There is no emphasis on slashing costs. The emphasis is on spending money to get the job done. In my opinion, spending money to solve problems via the scale up and scale out approach is okay as long as there are barrels of cash to throw at the problem. The better approach, in my opinion is to engineer solutions that make scaling and performance as economical as possible and direct investment at finding ways to leap frog over the well-known, long-standing problems with the Codd database model, inefficient and latency inducing message passing, and dedicated hardware for specific functions and applications then replicating these clusters. And, finally, using more hardware that is, in effect, sitting like an idle railroad car until needed. What happens when the money for these expensive approaches becomes less available?

Written by Stephen E. Arnold · Filed Under Business strategy, Cost, Database, Feature, Google, Microsoft, Online (general) | Comments Off on Microsoft: 1999 to 2008

The Economics of Dealing with Complex Information

May 24, 2008

Microsoft announced via its Live Search blog that its Live Search Books and Live Search Academic are “taken down”. Google’s book digitization and journal project caused concern to the commercial database vendors. Google, with its generous cash flow and avowed goal of indexing “all the world’s information” seemed to sign the death warrants of such companies as Dialog, Ebsco, and ProQuest, among others. A flap of the wings to Techmeme for its related links.

The economics of doing anything significant with complex information are not taught in the ivory towers at Harvard, Stanford, and Yale. Google–indifferent to the brutal economics that hobble commercial database publishers–has the cash to figure out how to use software to do tasks usually done by humans. For example, Google has figured out how to scan a book, have software determine what should be converted to ASCII, and generating a reasonably clean, searchable text file. The page images are mostly linked to the correspond text references. Not so for most database producers. These decisions still require humans, often working in exotic locations where labor is less expensive than in Ann Arbor, Boston, and Denver.

Google also has figured out how to take content, apply structure to it, create a variety of additional index terms (metadata), and convert the whole shebang into easily manipulated numerical representations. Not so with the mainstream commercial database publishers. Tagging, cross referencing, and content clean up still takes expensive humans.

Manipulating the information in books and journals is for commercial database producers very expensive. Many costs are difficult to reduce. Google, on the other hand, has invested over the last decade to find software solutions to these intractable cost problems. Fortunately for the commercial database publishers, Google so far has been content to process books and journals. Google finds access to weighty tomes useful for a variety of purposes. I haven’t heard that these motive forces are related to revenue. Google appears to be casual about the cost of its books and journals project. If you aren’t familiar with Google Books, navigate to http://books.google.com. For Google Scholar, go to http://scholar.google.com.

Enter Microsoft. The company jumped to index books and journals. Now it is climbing out of the swamp of costs. Unlike Google, Microsoft faces–maybe for the first time in the company’s history–a need to focus its technical and financial resources. Google keeps on scanning and indexing documents about hyperbolic geometry. Microsoft can’t and no longer will.

For me the most telling statement in the announcement is:

Given the evolution of the Web and our strategy, we believe the next generation of search is about the development of an underlying, sustainable business model for the search engine, consumer, and content partner. For example, this past Wednesday we announced our strategy to focus on verticals with high commercial intent, such as travel, and offer users cash back on their purchases from our advertisers. With Live Search Books and Live Search Academic, we digitized 750,000 books and indexed 80 million journal articles. Based on our experience, we foresee that the best way for a search engine to make book content available will be by crawling content repositories created by book publishers and libraries. With our investments, the technology to create these repositories is now available at lower costs for those with the commercial interest or public mandate to digitize book content. We will continue to track the evolution of the industry and evaluate future opportunities.

Here’s how I read this. First, the reference to next-generation search is about making money with a business model. In short, next-generation search is not about moving beyond traditional metadata, pushing into data management, and creating new types of user experiences. Search at Microsoft means money.

Second, Microsoft wants to index what’s available. That’s certainly less costly than fiddling with the train schedules that Google has indexed at Oxford University. In my experience, indexing what is already available begs for applications that moves beyond what I can do at my local library or with a search engine such as Exalead.com or metasearch system such as Vivisimo’s Clusty.com.

Third, the notion of tracking and looking for future opportunities does not convince me that Microsoft knows what it will do tomorrow. And whatever the company does, by definition, will be reactive.

Microsoft’s termination of this service means that the status quo in the commercial database world will be subject to pressure from Google. More troubling is that Google’s technical papers and its patent documents reveal that the company is moving beyond key word search at an increasing pace. I think that it is significant that Microsoft is husbanding its resources. Now I want to read in a Microsoft Web log about an innovation path that will permit the company to leap frog over Google. Send me a link to this information, and you will receive a gentle quack.

Stephen Arnold, May 24, 2008

Written by Stephen E. Arnold · Filed Under Cost, Feature, Google, Microsoft, Search | Comments Off on The Economics of Dealing with Complex Information

Search: The Three Curves of Despair

March 27, 2008

For my 2005 seminar series “Search: How to Deliver Useful Results within Budget”, I created a series of three line charts. One of the well-kept secrets about behind-the-firewall search is that costs are difficult, if not impossible, to control. That presentation is not available on my Web site archive, and I’m not sure I have a copy of the PowerPoint deck at hand. I did locate the Excel sheet for the chart which appears below. I thought it might be useful to discuss the data briefly and admittedly in an incomplete way. (I sell information for a living, so I instinctively hold some back to keep the wolves from my log cabin’s door here in rural Kentucky.)

Let me be direct: Well-dressed MBAs and sallow financial mavens simply don’t believe my search cost data.

At my age, I’m used to this type of uninformed skepticism or derisory denial. The information technology professionals attending my lectures usually smirk the way I once did as a callow nerd. Their reaction is understandable. And I support myself by my wits. When these superstars lose their jobs, my flabby self is unscathed. My children are grown. The domicile is safe from creditors. I’m offering information, not re-jigging inflated egos.

Now scan these three curves.

You see a gray line. That is the precision / recall curve. This refers to a specific method of determining if a query returns results germane to the user’s query and another method for figuring out how much germane information the search system missed. Search and a categorical affirmative such as “all” do not make happy bedfellows. Most folks don’t know what a search system does not include. Could that be one reason why the “curves of despair” evoke snickers of disbelief? Read more

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Feature | 1 Comment

Indexing Hot Spots

February 29, 2008

Introduction

This is the third in a series of cost hot spots in behind-the-firewall search. This essay does not duplicate the information in Beyond Search, my new study for the Gilbane Group. This document is designed to highlight several functions or operations in an indexing subsystem than can cause system slow downs or bottlenecks. No specific vendors’ systems are referenced in this essay. I see no value in finger pointing because no indexing subsystem is without potential for performance degradation in a real world installation. – Stephen Arnold, February 29, 2008

Indexing: Often a Mysterious Series of Multiple Operations

One of the most misunderstood parts of a behind-the-firewall search system is indexing. The term indexing itself is the problem. For most people, an index is the key word listing that appears at the back of a book. For those hip to the ways of online, indexing means metatagging, usually in the form of a series of words or phrases assigned to a Web page or an element in a document; for example, an image and its associated test. The actual index in your search system may not be one data table. The index may be multiple tables or numeric values that “float” mysteriously within the larger search system. The “index” may not even be in one system. Parts of the index are in different places, updated in a series of processes that cannot be easily recreated after a crash, software glitch, or other corruption. This CartoonStock.com image makes clear the impact of a search system crash.

Centuries ago, people lucky enough to have a “book” learned that some sort of system was needed to find a scroll, stored in a leather or clay tube, sometimes chained to the wall to keep the source document from wandering off. In the so called Dark Ages, information was not free, nor did it flow freely. Information was something special and of high value. Today, we talk about information as a flood, a tidal wave, a problem. It is ubiquitous, without provenance, and digital. Information wants to be free, fluid, moving around, and unstable, dynamic. For indexing to work, you have a specific object at a point in time to process; otherwise, the index is useless. Also, the index must be “fresh”. Fresh means that the most recent information is in the system and therefore available to users. With lots of new and changed information, you have to determine how fresh is fresh enough. Real time data also provides a challenge. If your system can index 100 megabytes a minute and to keep up with larger volumes of new and changed data, something’s got to give. You may have to prioritize what you index. You handle high-priority documents first, then shift to lower priority document until new higher-priority documents arrive. This triage affects the freshness in the index or you can throw more hardware at your system, thus increasing capital investment and operational cost.Index freshness is important. A person in a professional setting cannot do “work” unless the digital information can be located. Once located, the information must be the “right” information. Freshness is important, but there are issues of versions of documents. These are indexing challenges and can require considerable intellectual effort to resolve. You have to get freshness right for a search system to be useful to your colleagues. In general, the more involved your indexing, the more important is the architecture and engineering of the “moving parts” in your search system’s indexing subsystem.Why is indexing a cost hot spot? Let’s look at some hot spots I have encountered in the last nine months.

Remediating Indiscriminate Indexing

When you deploy your behind-the-firewall search or content processing system, you have to tell your system how to process the content. You can operate an advanced system in default mode, but you may want to select certain features, level of stringency, and make sure that you are familiar with the various controls available to you. Investing time prior to deployment in testing may be useful when troubleshooting. The first cost hot spot is encountering disc thrashing or long indexing times. You come in one morning, check the logs, and learn no content was processed. In Beyond Search I talk about some steps you can take to troubleshoot this condition. If you can’t remediate the situation by rebooting the indexing subsystem, then you will have to work through the vendor’s technical support group, restore the system to a known good state, or – in some cases – reinstall the system. When you reinstall, some systems cannot use the back up index files. If you find that your back ups won’t work or deliver erratic results on test queries, then you may have to rebuild the index. In a small two person business, the time and cost are trivial. In an organization with hundreds of servers, the process can consume significant resources.

Updating the Index or Indexes

Your search or content processing system allows you to specify how frequently the index updates. When your system has robust resources, you can specify indexing to occur as soon as content becomes available. Some vendors talk about their systems as “real time” indexing engines. If you find that your indexing engine starts to slow down, you may have encountered a “big document” problem. Indexing systems make short work of HTML pages, short PDFs, and emails. But when document size grows, the indexing subsystem needs more “time” to process long documents. I have encountered situations in which a Word document includes objects that are large. The Word document requires the indexing subsystem to grind away on this monster file. If you hit a patch characterized by a large number of big documents, the indexing subsystem will appear to be busy but indexing subsystem outputs fall sharply.Let’s assume you build your roll out index based on a thorough document analysis. You have verified security and access controls so the “right” people see the information to which they have appropriate access. You know that the majority of the documents your system processes are in the 600 kilobyte range over the first three months of indexing subsystem operation. Suddenly the document size leaps to six megabytes and the number of big documents becomes more than 20 percent of the document throughput. You may learn that the set up of your indexing subsystem or the resources available are hot spots.Another situation concerns different versions of documents. Some search and content processing systems identify duplicates using date and time stamps. Other systems include algorithms to identify duplicate content and remove it or tag it so the duplicates may or may not be displayed under certain conditions. A surge in duplicates may occur when an organization is preparing for a trade show. Emails with different versions of a PowerPoint may proliferate rapidly. Obviously indexing every six megabyte PowerPoint makes sense if each PowerPoint is different. How your indexing subsystem handles duplicates is important. A hot spot occurs when a surge in the number of files with the same name and different date and time stamps are fed into the indexing system. The hot spot may be remediated by identifying the problem files and deleting them manually or via your system’s administrative controls. Versions of documents can become an issue under certain circumstances such as a legal matter. Unexpected indexing subsystem behavior may be related to a duplicate file situation.Depending on your system, you will have some fiddling to do in order to handle different versions of documents in a way that makes sense to your users. You also have to set up a de-duplication process in order to make it easy for your users to find the specific version of the document needed to perform a work task. These administrative interventions are not difficult when you know where to look for the problem. If you are not able to pinpoint a specific problem, the hunt for the hot spot can become time consuming.

Common Operations Become a Problem

Once an index has been constructed – a process often called indexation – incremental updates are generally trouble free. Notice that I said generally. Let’s look at some situations that can arise, albeit infrequently.Index RebuildYou have a crash. The restore operation fails. You have to reindex the content. Why is this expensive? You have to plan reindexing and then baby sit the update. For reindexing you will need the resources required when you performed the first indexation of your content. In addition, you have to work through the normal verifications for access, duplicates, and content processing each time you update. Whatever caused the index restore operation to fail must be remediated, a back up created when reindexing is completed, and then a test run to make sure the new back up restores correctly.Indexing New or Changed ContentLet’s assume that you have a system, and you have been performing incremental indexes for six months with no observable problems and no red flags from users. Users with no prior history of complaining about the search system complain that certain new documents are not in the system. Depending on your search system’s configuration, you may have a hot spot in the incremental indexing update process. The cause may be related to volume, configuration, or an unexpected software glitch. You need to identify the problem and figure out a fix. Some systems maintain separate indexes based on a maximum index size. When the index grows beyond a certain size, the system creates or allows the system administrator to create a second index. Parallelization makes it possible to query index components with no appreciable increase in system response time. A hot spot can result when a configuration error causes an index to exceed its maximum size, halting the system or corrupting the index itself, although other symptoms may be observable. Again – the key to resolving this hot spot is often configuration and infrastructure.Value-Added Content ProcessingNew search and content processing systems incorporate more sophisticated procedures, systems, and methods than systems did a few years ago. Fortunately faster processors, 64-bit chips, and plummeting prices for memory and storage devices allows indexing systems to pile on the operations and maintain good indexing throughput, easily several megabytes a minute to five gigabytes of content per hour or more.If you experience slow downs in index updating, you face some stark choices when you saturate your machine capacity or storage. In my experience, these are:

Reduce the number of documents processed
Expand the indexing infrastructure; that is, throw hardware at the problem
Turn off certain resource intensive indexing operations; in effect, eliminating some of the processes that use statistical, linguistic, or syntactic functions.

One of the questions that comes up frequently is, “Why are value-added processing systems more prone to slow downs?” The answer is that when the number of documents processed goes up or the size of documents rises, the infrastructure cannot handle the load. Indexing subsystems require constant monitoring and routine hardware upgrades.Iterative systems cycle through processes two or more times.Some iterative functions are dependent on other processes; for example, until the linguistic processes complete, another component – for example, entity extraction – cannot be completed. Many current indexing systems are be parallelized. But situations can arise in which indexing slows to a crawl because a software glitch fails to keep the internal pipelines flowing smoothly. If process A slows down, the lack of available data to process means process B waits. Log analysis can be useful in resolving this hot spot.Crashes: Still OccurMany modern indexing systems can hiccup and corrupt an index. The way to fix a corrupt index is to have two systems. When one fails, the other system continues to function.But many organizations can’t afford tandem operation and hot failovers. When an index corruption occurs, some organizations restore the index to a prior state. A gap may exist between the points in the back up and the index state at the time of the failure. Most systems can determine which content must be processed to “catch up”. Checking the rebuilt indexes is a useful step to take when a crash has taken place and the index restored and rebuilt. Keep in mind that back ups are not fool proof. Test your system’s back up and restore procedures to make sure you can survive a crash and have the system again operational.

Wrap Up

Let’s step back. The hot spots for indexing fall into three categories. First, you have to have adequate infrastructure. Ideally your infrastructure will be engineered to permit pipelined functions to operate rapidly and without latency. Second, you will want to have specific throughput targets so you can handle new and changed content whether your vendor requires one index or multiple indexes. Third, you will want to understand how to recover from a failure and have procedures in place to restore an index or “roll back” to a known good state and then process content to ensure no lost content.In general, the more value added content processing you use, your potential for hot spots increases. Search used to be simpler from an operational point of view. Key word indexing is very straight forward compared to some of the advanced content processing systems in use today. The performance of any system fluctuates to some extent. As sophisticated as today’s systems are, there is room for innovation in system design, architecture, and administration of indexing subsystems. Keep in mind that more specific information appears in Beyond Search, due out in April 2008.

Stephen Arnold, February 29, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search | Comments Off on Indexing Hot Spots

Document Processing: Transformation Hot Spots

February 23, 2008

Let’s pick up the thread of sluggish behind-the-firewall search systems. I want to look at one hot spot in the sub system responsible for document processing. Recall that the crawler sub system finds or accepts information. The notion of “find” relates to a crawler or spider able to identify new or changed information. The spider copies the information back to the content processing sub system. For the purposes of our discussion, we will simplify spidering to the find-and-send back approach. The other way to get content to the search system is to push it. The idea is that a small program wakes up when new or changed content is placed in a specific location on a server. The script “pushes” the content — that is, copies the information — to a specific storage area on the content processing sub system. So, we’re dealing with pushing or pulling content. The diagram to which these comments refer is here.

Now what happens?

There are many possible functions a vendor can place in the document processing subsystem. I want to focus on one key function — content transformation. Content transformation takes a file — let’s say a PowerPoint — and creates a version of this document in an XML structure “known” to the search system. The idea is that a number of different file types are found in an organization. These can range from garden variety Word 2003 files to the more exotic XyWrite files still in use at certain US government agencies. (Yes, I know that’s hard to believe because you may not know what XyWrite is.)

Most search system vendors say, “We support more than 200 different file types.” That’s true. Over the years, scripts that convert a source file of one type into an output file of another type have been written. Years ago, there were independent firms doing business as Data Junction and Outside In. These two companies, along with dozens of others, have been acquired. A vendor can license these tools from their owners. Also, there are a number of open source conversion and transformation tools available from Source Forge, shareware repositories, and from freeware distributors. However, a number of search system vendors will assert, “We wrote our own filters.” This is usually a way to differentiate their transformation tools from a competitor. The reality is that most vendors get use a combination of licensed tools, open source tools, and home-grown tools. The key point is the answer to two questions:

How well do these filters or transformation routines work on the specific content you want to have the search system make searchable?
How fast do these systems operate on the specific production machines you will use for content transformation?

The only way to answer these two questions with accuracy is to test the transformation throughput on your content and on the exact machines you will use in production. Any other approach will create a general throughput rate value that your production system may or may not be able to deliver. Isn’t it better to know what you can transform before you start processing content for real?

I’ve just identified the two reasons for unexpected bottlenecks and, hence, poor document processing performance. First, you have content that the vendor’s filters cannot handle. When a document processing sub system can’t figure out how to transform a file, it writes the file name, date, time, size, and maybe an error code in the document processing log. If you have too many rejected files, you have to intervene, figure out the problem with the files, and then take remedial action. Remedial action may mean re keying the file or going through some manual process of converting the file from its native format, to a neutral format like ASCII, doing to manual touch up like adding sub heads or tags, and then putting the touched up file into the document processing queue. Talk about a bottleneck. In most organizations, there is neither money nor people to do this work. Fixing the content transformation problems can take days, week, or never be done at all. Not surprisingly, a system that can’t process the content cannot make that content available to the system users. This glitch is a trivial problem when you are first deploying a system because you don’t have much knowledge of what will be transformed and what won’t. Imagine the magnitude of the problem when a transformation problem is discovered after the system is up and running. You may find log files over writing themselves. You may find “out of space” messages in the folder used by the system to write files that can’t be transformed. You may find intermittent errors cascading back through the content acquisition system due to transformation issues. Have you looked at your document processing log files today?

The second problem has to do with document processing hardware. In my experience exactly zero of the organizations with which I am familiar have run pre-deal tests on the exact hardware that will be used in production document processing. The exception are the organizations licensing appliances. The appliance vendors deliver hardware with a known capacity. Appliances, however, comprise less than 15 percent of the installed base of behind-the-firewall search systems. Most organizations’ information technology departments think that vendor estimates are good enough. Furthermore, most information technology groups believe that existing hardware and infrastructure are adequate for a search application. What happens? The system goes into operation and runs along until the volume of content to be proc3essed exceeds available resources. When that happens, the document processing sub system slows to a crawl or hangs.

Performance Erosion

Document processing is not a set-it and forget-it sub system. Let’s look at why you need to invest the time engineering, testing, monitoring, and upgrading the document processing sub system. I know before I summarize the information from my files that few, if any, readers of this Web log will take these actions. I must admit indifference to the document processing sub system generates significant revenue for consultants, but so many hassles can be avoided by taking some simple preventive actions. Sigh.

Let’s look at the causes of performance erosion:

The volume of content is increasing. Most organizations whose digital content production volume I have analyzed double their digital content every 12 months. This means that if one employee has five megabytes of new content when you turn on the system, that employee will have on her computer, 12 months after you start the search system, you will have the original five megabytes in the index and the new five megabytes for a total of 10 megabytes of content. No big deal, right? Storage is cheap. It is a big deal when you are working in an organization with constraints on storage, an inability to remove duplicate content from the index, and an indiscriminate content acquisition process. Some organizations can’t “plug in” new storage the way you can on a PC or Mac. Storage must be ordered, installed, and certified. In the meantime, what happens? The document processing system falls behind. Can it catch up? Maybe. Maybe not.
The content is not new. Employees recycle, save different drafts of documents, and merge pieces of boiler plate text to create new documents. Again, if you work on one PowerPoint, you can index any PowerPoint. But when you have many PowerPoints each with minor changes and the email messages like “Take a look at this an send me your changes”, you can index the same content again and again. A results list is not just filled with irrelevant hits; the basic function of search and retrieval is broken. Does your search system return a results list of what look like the same document with different date, time, and size values? How do you determine which version of the document is the “best and final” one? What are the risks of using the incorrect version of a document? How much does your organization spend on figuring out which version of a document is the “one” the CEO really needs?

As you wrestle with these questions, recall that you are shoving more content through a system which unless constantly upgraded will slow to a crawl. You have set the stage for thrashing. The available resources are being consumed processing the same information again and again, not processing the meaningful documents one time and again only when a significant change is made. Ah, you don’t know what documents are meaningful? You are now like the snake eating its tail. Because you don’t have an editorial policy or content acquisition procedures in place, you have found the slow down in document processing is nothing more than a consequence of an earlier misstep. So, no matter what you do to “fix” document processing, you won’t be able to get your search system working the way users want it to. Pretty depressing? Furthermore, senior management doesn’t understand why throwing money at a problem in document processing doesn’t have any significant pay off to justify the expense.

XML and Transformation

I’m not sure I can name a search vendor who does not support XML. XML is an incantation. Say it enough times, and I guess it becomes the magic fix to what ever ailments a content processing system has.

Let me give you my view of this XML baloney. First, XML or extensible mark up language is not a panacea. XML is, at its core, a programmatic approach to content. How many of you reading this column program anything in any language? Darn few. So the painful truth is, you don’t know how to “fix” or “create” a valid XML instance, but you sure sound great when you chatter about XML.

Second, XML is a simplified version of SGML which is in turn a decendent of CALS (computer aided logistics system) spawned by our insightful colleagues in the US government to deal with procurement. Lurking behind a nice Word document in the “new” DOCX format is a DTD, document type definition. But out of sight, out of mind, correct? Unfortunately, no.

Third, XML is like an ancient Roman wall in 25 BCE. The smooth surface conceals a heck of a lot of rubble between some rigid structures made of irregular brick or stone. This means that taking a “flavor” of XML and converting it to the XML that your search system understands is a programmatic process. This notion of converting a source file like a WordPerfect document into an XML version that the search system can use is pretty darn complicated., When it goes wacky, it’s just like debugging any other computer program. Who knows how easy or hard it will be to find and fix the error? Who knows how long it will take? Who knows how much it will cost? I sure don’t.

If we take these three comments and think about them, it’s evident that this document transformation can chew up some computer processing cycles. If a document can’t be transformed, the exception log can grow. Dealing with these exceptions is not something one does in a few spare minutes between meetings.

Nope.

XML is work, which when done properly, greatly increases the functionality of indexing sub systems. When done poorly, XML is just another search system nightmare.

Stepping Back

How can you head off these document processing / transformation challenges?

The first step is knowing about them. If your vendor has educated you, great. If you have learned from the school of hard knocks, that’s probably better. If you have researched search and ingested as much other information as you can, you go to the head of the class.

An increasing number of organizations are solving this throughput problem by: [a] ripping and replacing the incumbent search system. At best, this is a temporary fix; [b] shifting to an appliance model. This works pretty well, but you have to keep adding appliances to keep up with content growth and the procedure and policy issues will surface again unless addressed before the appliance is deployed; [c] shifting to a hosted solution. This is an up-and-coming fix because it outsources the problem and slithers away from the capital investment on-premises installations require.

Notice that I’m not suggesting slapping an adhesive bandage on your incumbent search system. A quick fix is not going to do much more than buy time. In Beyond Search, I go into some depth about vendors who can “wrap” your ailing search system with a life-support system. This approach is much better than a quick fix, but you will have to address the larger policy and procedural issues to make this hybrid solution work over the long term.

You are probably wondering how transforming a bunch of content can become such a headache. You have just learned something about the “hidden secrets” of behind-the-firewall search. You have to dig into a number of murky, complex areas before you make your search system “live.”

I think the following checklist has not been made available without charge before. You may find it useful, and if I have left something out, please, let me know via the comments function on this Web log.

How much information in what format must the search system acquire and transform on a monthly and annual basis?
What percent of the transformation is for new content? How much for changed content?
What percent of content that must be processed exists in what specific file types? Does our vendor’s transformation system handle this source material? What percent of documents cannot be handled?
What filters must be modified, tested, and integrated into the search systems?
What is the administrative procedure for dealing with [a] exceptions and [b] new file types such as an email with an unrecognized attachment?
What is the mechanism for determining what content is a valid version and which content is a duplication? What pre-indexing process must be created to minimize system cycles needed to identify duplicate content; that is, how can I get my colleagues to flag only content that should be indexed before the content is acquired by the document processing system?
What is the upgrade plan for the document processing sub system?
What content will not be processed if the document processing sub system slows? What is the procedure for processing excluded content when the document processing subsystem again has capacity?
What is the financial switch over point from on-premises search to an appliance or a hosted / managed service model?
What is the triage procedure when a document processing sub system degrades to an unacceptable level?
What’s the XML strategy for this search system? What does the vendor do to fix issues? What are my contingency plans and options when a problem becomes evident?

In another post, I want to look at hot spots in indexing. What’s intriguing is that so far we have brought or had content pushed to the search system storage devices. We have normalized content and written that content in a form the indexing system can understand to the storage sub system. Is any one keeping track of how many instances of a document we have in the search system at any one time? We need that number. If we run out of storage, we’re dead in the water.

This behind-the-firewall search is a no-brainer. Believe it or not, a senior technologist at a 10,000-person organization told me in late 2007, “Search is not that complicated.” That’s a guy who really knows his information retrieval limits!

Stephen Arnold, February 23, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search | 1 Comment

The Content Acquisition Hot Spots

February 21, 2008

I want to take a closer look at behind-the-firewall search system bottle necks. This essay talks about the content acquisition hot spot. I want to provide some information, but I will not go into the detail that appears in Beyond Search.

Content acquisition is a core function of a search system. “Classic” search systems are designed to pull content from a server where the document resides to the storage device the spider uses to hold the new or changed content. Please, keep in mind that you will make a copy of a source document, move it over the Intranet to the spider, and store that content object on the storage device for new or changed content. The terms crawling or spidering have been used since 1993 to describe the processes for:

Finding new or changed information on a server or in a folder
Copying that information back to the search system or the crawler sub system
Writing information about the crawlers operation to the crawler log file.

On the surface, crawling seems simple. It’s not. Crawlers or spiders require configuration. Most vendors provide a browser-based administrative “tool” that makes is relatively easy to configure the most common settings. For example, you will want to specify how often the content acquisition sub system checks for new or changed content. You also have to “tell” the crawling sub system what servers, computers, directories, and files to acquire. In fact, the crawling sub system has a wide range of settings. Many systems allow you to provide create “rules” or special scripts to handle certain types of content; for example, you can set a specific schedule for spidering for certain servers or folders.

In the last three or four years, more search systems have made it easier for the content acquisition system to receive “pushed” content. The way “push” works is that you write a script or use the snippet of code provided by the search vendor to take certain content and copy it to a specific location on the storage device where the spider’s content resides. I can’t cover the specifics of each vendor’s “push” options, but you will find the details in the Help files, API documentation, or the FAQs for your search system.

Pull

Pull works pretty well when you have a modest amount of new or changed content every time slice. You determine the time interval between spider runs. You can make the spider aggressive and launch the sub system every 60 seconds. You can relax the schedule and check for changed content every seven days. In most organizations, running crawlers every minute can suck up available network bandwidth and exceed the capacity of the server or servers running the crawler sub system.

You now have an important insight into the reason the content acquisition sub system can become a hot spot. You can run out of machine resources, so you will have to make the crawler less aggressive. Alternatively, you can saturate the network and the crawler sub system by bringing back more content than your infrastructure can handle. Some search systems bring back content that exceeds available storage space. Your choices are stark — limit the number of servers and folders the crawling sub system indexes.

When you operate a behind-the-firewall search system, you don’t have the luxury a public Web indexing engine has. These systems can easily skip a server that times out or not revisit a server until the next spidering cycle. In an organization, you have to know what much be indexed immediately or as close to immediately as you can get. You have to acquire content from servers that may time out.

The easy fixes for crawler sub system problems are likely to create some problems for users. Users don’t understand why a document may not be findable in the search system. The reason may be that the crawler subsystem was not able to get the document back to the search system for many different reasons. Believe me, users don’t care.

The key to avoiding problems with traditional spidering boils down to knowing how much new and changed content your crawler sub system must handle at peak loads. You also must know the rate of growth for new and changed content. You need the first piece of information to specify the hardware, bandwidth, storage, and RAM you need for the server or servers handling content acquisition. The second data point gives you the information you need to upgrade your content acquisition system. You have to keep the content acquisition system sufficiently robust to handle the ever-larger amount of information generated in organizations today.

The cause of a hot spot in content acquisition is due to:

Insufficient resources
Failure to balance crawler aggressiveness with machine resources
Improper handling of high-latency response from certain systems whose content must be brought back to the search storage sub system for indexing.

The best fix is to do the up front work accurately and thoroughly. To prevent problems from happening, a proactive upgrade path must be designed and implemented. Routine maintenance and tuning must be routine operations, not “we’ll do it later” procedures.

Push

Push is another way to reduce the need for the content acquisition sub system to “hit” the network at inopportune times. The idea is simple, and it is designed to operate in a way directly opposite from the “pull” service that gave content “pull” a bad reputation. PointCast “pulled” content indiscriminately, causing network congestion.

The type of “pull” I am discussing is a fall out of the document inventory conducted before you deploy the first spider. You want to identify those content objects that can be copied from their host location to the content acquisition storage sub system using a crontab file or a script that triggers the transfer when [a] the new or changed data are available and [b] at off-peak times.

The idea is to keep the spiders from identifying certain content objects and then moving those files from their host location to the crawler storage device at inopportune moments.

In order to make “push” work, you need to know which content is a candidate for routine movement. You have to set up the content acquisition system to receive “pushed” content, which is usually handled via the graphical administrative interface. You need to create the script or customize the vendor-provided function to “wake up” when new of changed content arrives in a specific folder on the machine hosting the content. Then the script consults the rules for starting the “push”. The transfer occurs and the script should verify in some way that the “pushed” file was received without errors.

Many vendors of behind-the-firewall search systems support “push”. If your system does not, you can use the API to create this feature. While not trivial, a custom “push” function is a better solution than trying to get a crashed content acquisition sub system back online. You run the risk of having to reacquire the content, which can trigger another crash or saturate the network bandwidth despite your best efforts to prevent another failure.

Why You Want to Use Both Push and Pull

The optimal content acquisition sub system will use both “push” and “pull” techniques. Push can be very effective for high-priority content that must be indexed without waiting for the crawler to run a CRC, time stamp, or file size check on content.

The only way to make the most efficient use of your available resources is to designate certain content as “pull” and other content as “push”. You cannot guess. You must have accurate baseline data and update those data by consulting the crawler logs.

You will want to develop schedules for obtaining new and changed content via “push” and “pull”. You may want to take a look at the essay on this Web log about “hit boosting”, a variation on “push” content with some added zip to ensure that certain information appears in the context you want it to show up.

Where Are the Hot Spots?

If you have a single server and your content acquisition function chokes, you know the main hot spot — available hardware. You should place the crawler sub system on a separate server or servers.

The second hot spot may be the network bandwidth or lack of it when you are running the crawlers and pushing data to the content acquisition sub system. If you run out of bandwidth, you face some specific choices. No choice is completely good or bad. The choices are shades of gray; that is, you must make trade offs. I will hightly three, and you can work through the others yourself.

First, you can acquire less content less frequently. This reduces network saturation, but it increases the likelihood that users will not find the needed information. How can they? The information has not yet been brought to the search system for document processing.

Second, you can shift to “push”, de emphasizing “pull” or traditional crawling. The upside is that you can control how much content you move and when. The downside is that you may inadvertently saturate the network when you are “pushing”. Also, you will have to do the research to know what to “push” and then you have to set up or code, configure, test, debug, and deploy the system. If people have to move the content to the folder the “push” script uses, you will need to do some “human engineering”. It’s better to automate the “push” function in so far as possible.

Third, you have to set up a re-crawl schedule. Skipping servers may not be an option in your organization. Of course, if no one notices missing content, you can take your chances. I suggest knuckling down and doing the job correctly the first time. Unfortunately, short cuts and outright mistakes are very common in the content acquisition piece of the puzzle.

In short, hot spots can crop up in the crawler sub system. The causes may be human, configuration, infrastructure, or a combination of causes.

Is This Really a Big Deal?

Vendors will definitely tell you the content acquisition sub system is no big deal. You may be told, “Look we have optimized our crawler to avoid these problems” or “Look, we have made push a point-and-click option. Even my mom can set this up.”

Feel free to believe these assurances. Let me close with an anecdote. Judge for yourself about the importance of staying on top of the content acquisition sub system.

The setting is a large US government agency. The users of the system were sending search requests to an Intranet Web server. The server would ingest the request and output a list of results. No one noticed that the results were incomplete. An audit revealed that the content acquisition sub system was not correctly identifying changed content. The error caused more than four million reports to be incorrect. Remediation cost more than $10 million. Upon analyzing the problem, facts came to light that the crawler was incorrectly configured when the system was first installed, almost 18 months before the audit. In addition to the money lost, certain staff were laterally arabesqued. Few in Federal employ get fired.

Pretty exciting for a high-profile vendor, a major US agency, and the “professionals” who created this massive problem.

Now, how important is your search system’s content acquisition sub system to you?

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search | Comments Off on The Content Acquisition Hot Spots

Search System Bottlenecks

February 21, 2008

In a conference call yesterday (February 19, 2008), one of the well-informed participants asked, “What’s with the performance slow downs in these behind-the-firewall search systems?”

I asked, “Is it a specific vendor’s system?”

The answer, “No, it seems like a more general problem. Have you heard anything about search slow downs on Intranet systems?”

I do hear quite a bit about behind-the-firewall search systems. People find my name on the Internet and ask me questions. Others get a referral to me. I listen to their question or comment and try to pass those with legitimate issues to someone who can help out. I’m not too keen on traveling to a big city, poking into the innards of a system, and trying to figure out what went off track. That’s a job for younger, less jaded folks.

But yesterday’s question got me thinking. I dug around in my files and discovered a dated, but still useful diagram of the major components of a behind-the-firewall search system. Here’s the diagram, which I know is difficult to read, but I want to call your attention to the seven principal components of the diagram and then talk briefly about hot spots. I will address each specific hot spot in a separate Web log post to keep the length manageable.

This essay, then, takes a broad look at the places I have learned to examine first when trying to address a system slow down. I will try to keep the technical jargon and level of detail at a reasonable level. My purpose is to provide you with an orientation to hot spots before you begin your remediation effort.

The Bird’s Eye View of a Typical Search System

Keep in mind that each vendor implements the search sub systems in a way appropriate for their engineering. In general, if you segment the sub systems, you will see a horizontal area in the middle of this diagram surrounded by four key subsystems, the content, and, of course, the user. The search system exists for the user, which many vendors and procurement teams happily overlook.

This diagram has been used in my talks at public events for more than five years. You may use this for your personal use or in educational activities without restrictions. If you want to use it in a publication, please, provide contact me for permission.

Let’s run through this diagram and then identify the hot spots. You see some arrows. These are designed to show the pipeline through which content, queries, and results flow. In several places, you see arrows pointing different directions in close proximity. It is obvious that in these interfaces, a glitch of any type will create a slowdown. Now let’s identify the main features.

In the upper left hand corner is a blue sphere that represents content. For our purpose, let’s just assume that the content resides behind the firewall, and it is the general collection of Word documents, email, and PowerPoints that make up much of an organization’s information. Pundits calculate that 80 percent of an organization’s information is unstructured. My research suggests that the ratio of structured to unstructured data varies sharply by type of organization. For now, let’s just deal with generalized “content”. In the upper right hand corner, you see the user. The user, like the content, can be generalized for our purposes. We will assume that the user navigates to a Web page, sees a search box, or a list of hot links, and enters a query in some way. I don’t want to de emphasize the user’s role in this system, but I want to set aside her needs, the hassle of designing an interface, and other user-centric considerations such as personalization.

Backbone or Framework

Now, let’s look at the horizontal area in the center of the diagram show below:

You can see that there are specific sub systems within this sub system labeled storage clusters. This is the first key notion to keep in mind when thinking about performance of a search system. The problem that manifests itself at an interface may be caused by a sub component in a sub system. Until there’s a problem, you may not have thought about your system as a series of nested boxes. What you want to keep in mind is that until you have a performance bottleneck is that the many complex moving parts were working pretty well. Don’t criticize your system vendor without appreciating how complicated a search system is. These puppies are far from trivial — including the free one you download to index documents on your Mac or PC.

In this rectangle are “spaces” — a single drive or clusters of servers — that hold content returned from the crawling sub system (described below), the outputs of the document processing sub system (described below), the index or indexes, the system that holds the “content” in some proprietary or other representation, and a component to house the “metrics” for the system. Please keep in mind that running analytics is a big job, and you will want to make sure that you have a way to store, process, and manipulate system logs. No system logs — well, then, you are pretty much lost in space when it comes to trouble shooting. One major Federal agency could not process its logs; therefore, usage data and system performance information did not exist. Not good. Not good at all.

The components in this sub system handle content acquisition, usually called crawling or spidering. I want to point out that the content acquisition sub system can be a separate server or cluster of servers. Also, keep in mind that keeping the content acquisition sub system on track requires that you fiddle with rules. Some systems like Google’s search appliance reduce this to a point-and-click exercise. Other systems require command line editing of configuration files. Rules may be check boxes or separate scripts / programs. Yes, you have to write these or pay someone to do the rule fiddling. When the volume of content grows, this sub system can choke. The result is not a slow down, but you may find that some users say, “I put the content in the folder for indexing, and I can’t find the document.” No, the user can’t. It may be snagged in an over burdened content acquisition sub system.

Document Processing / Document Transformation

Let me define what I mean by document processing. I am using this term to mean content normalization and transformation. In Beyond Search, I use the word transformation to stream line the text. In this sub system, I am not discussing indexing the content. I want to move a Word file from its native Word format to a form that can be easily ingested by the indexing sub system described in the next section of this essay.

This sub system pulls or accepts the information acquired by the spidering sub system. Each file is transformed into an representation that the indexed sub system (described below) can understand. Transformation is now a key part of many behind-the-firewall systems. The fruit cake of different document types are normalized; that is, made standard. If a document cannot be manipulated by the system, then that document cannot be indexed. An increasing number of document transformation sub systems store the outputs in an XML format. Some vendors include an XML data base or data management system with their search system. Others use a data base system and keep it buried in the “guts” of their system. This notion of transformation means that disc writes will occur. The use of a data base system “under the hood” may impose some performance penalties on the document processing sub system. Traditional data base management systems can be input – output bound. A bottle neck related to an “under the hood” third-party, proprietary, or open source data base can be difficult to speed up if resources like money for hardware are scarce.

Indexing

Most vendors spend significant time explaining the features and functions of their systems’ indexing. You will hear about semantic indexing, latent semantic indexing, linguistics, and statistical processes. There are very real differences between vendors’ systems. Keep in mind that any indexing sub system is a complicate beastie. Here’s a blow up from the generalized schematic above:

In this diagram, you see knowledge bases, statistical functions, “advanced processes” (linguistics / semantics), and a reference to an indexing infrastructure. Indexing performs much of the “heavy lifting” for a search system, and it is absolutely essential that the indexing sub system be properly resourced. This means bandwidth, CPU cycles, storage, and random access memory. If the indexing sub system cannot keep pace with the amount of information to be indexed and the number of queries passed against the indexes, a number of symptoms become evident to users and the system administrator. I will return to the problems of an overloaded indexing subsystem in a separate essay in a day or two. Note that I have included “manual tagging” in the list of fancy processes. The notion of a fully automatic system, in my experience, is a goal, not a reality. Most indexing systems require over sight by a subject matter expert or indexing specialist. Both statistical and linguistic systems can get “lost in space.” There are many reasons such as language drift, neologisms, and exogenous shifts. The only reliable way to get these indexing glitches resolved is to have a human make the changes to the rules, the knowledge bases, or the actual terms assigned to individual records. Few vendors like to discuss these expensive, yet essential, interventions. Little wonder that many licensees feel snookered when “surprises” related to the indexing sub system become evident and then continue to crop up like dandelion.

Query Processing

Query processing is a variant of indexing. Queries have to be passed against the indexes. In effect, a user’s query is “indexed”. The query is matched or passed against the index, and the results pulled out, formatted, and pushed to the user. I’m not going to talk about stored queries or what used to be called SDI (selective dissemination of information), saved searches, or filters. Let’s just talk about a key word query.

The query processing sub system consists of some pre – and post – processing functions. A heavily-used system requires a robust query processing “front end.” The more users sending queries at the same time, the more important it is to be able to process those queries and get results back in an acceptable time. My tests show that a user of a behind-the-firewall system will wait as much as 15 seconds before complaining. In my tests on systems in 2007, I found an average query response time in the 20 second range, which explains in large part why employees are dissatisfied with their incumbent search system. The dissatisfaction is a result of an inadequate infrastructure for the search system itself. Dissatisfaction, in fact, does not single out a specific vendor. The vendors are equally dissatisfying. The vendors, obviously, can make their systems run faster, but the licensee has the responsibility to provide a suitable infrastructure on which to run the search system. In short, the “dissatisfaction” is a result of poor response time. Only licensees can “fix” this infrastructure problem. Blaming a search vendor for lousy performance is often a false claim. Notice that the functions performed within the query processing sub system are complex; for example, “on the fly” clustering, relevance ranking, and formatting. Some systems include work flow components that shape queries and results to meet the needs of particular employees or tasks. The work flow component then generates the display appropriate for the work task. Some systems “inject” search results into a third-party application so the employee has the needed information on a screen display related to the work task; for instance, a person’s investments or prior contact history.

Back to Hot Spots

Let me reiterate — I am using an older, generalized diagram. I want to identify the complexities within a representative behind-the-firewall search system. The purpose of this exercise is to allow me to comment on some general hot spots as a precursor to a quick look in a subsequent essay about specific bottle necks in subsystems.

The high level points about search system slow downs are:

A slow down in one part of the system may be caused be a deeper issue. In many cases, the problem could be buried deep within a particular component in a sub system. Glitches in search systems can, therefore, take some time to troubleshoot. In some cases, there may be no “fix”. The engineers will have to “work around” the problem which may mean writing code. Switching to a hosted service or a search appliance may be the easiest way to avoid this problem.
The slow down may be outside the vendor’s span of control. If you have an inadequate search system infrastructure, the vendor can advise you on what to change. But you will need the capital resources to make the change. Most slow downs in search systems are a result of the licensee’s errors in calculating CPU cycles, storage, bandwidth, and RAM. The cause of this problem is ignorance of the computational burden search systems place on their infrastructure. The fast CPUs are wonderful, but you may need clusters of servers, not one or two servers. The fix is to get outside verification of the infrastructure demands. If you can’t afford the plumbing, shift to a hosted solution or license an appliance.
A surge in either the amount of content to index or the numbers of queries to process can individually bring a system to a half. When the two coincide, the system will choke, often failing. If you don’t have log data and you don’t review it, you will not know where to begin looking for a problem. The logs are often orphans, and their data are voluminous, hard to process, and cryptic. Get over it. Most organizations have a steady increase in content to be processed and more users sending queries to the search system despite their dissatisfaction with its performance. In this case, you will have a system that will fail and then fail again. The fix is to buckle down, manage the logs, study what’s going on in the sub systems, and act in an anticipatory way. What’s this mean? You will have to continue to build out your system when performance is acceptable. If you wait until something goes wrong, you will be in a very precarious position.,

To wrap up this discussion, you may be reeling from the ill-tasting medicine I have prescribed. Slow downs and hot spots are a fact of life with complex systems such as search. Furthermore, the complexity of the search systems in general and their sub systems in particular are essentially not fully understood by most licensees, their IT colleagues, or their management. In the first three editions of the Enterprise Search Report, I discussed this problem at length. I touch upon it briefly in Beyond Search because it is critical to the success of any search or content processing initiative. If you have different experiences from mind, please, share them via the comments function on this Web log.

I will address specific hot spots in the next day or two.

Stephen Arnold, February 21, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search | Comments Off on Search System Bottlenecks

« Previous Page

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.

Categories
- 3D-Printing
- Acquisition
- Advertising
- Aggregation
- AI
- Alexa
- algorithms
- Amazon
- Amazonia
- Analytics
- Appliance
- Applications
- Audio
- Augmented Reality
- Big data
- Bing
- Bitcoin
- Bitext
- Book review
- Business intelligence
- Business process
- Business strategy
- Censorship
- Cloud computing
- Company Profile
- Conferences
- Connectors
- Consulting
- Consumer
- Content processing
- Copyright
- Corporate Concerns
- Cost
- Crawl
- Crowdfunding
- cryptocurrency
- Customer support
- Cyber OSINT
- cybercrime
- cybersecurity
- Dark Web
- DarkCyber
- Data
- Data mining
- Database
- Deepfakes
- Digital Assistant
- Digital Library
- E2EE
- ECommerce
- EDiscovery
- Editorial opinion
- Education
- Emoticons
- Enterprise
- Enterprise search
- Entity extraction
- Ethics
- Facebook
- Faceted search
- Factualities
- Feature
- Federated search
- Financial
- Google
- Governance
- Government
- Hackers
- healthcare
- IBM Watson
- Image search
- Indexing
- Infrastructure
- Innovation
- Integration
- intelware
- Interface
- Internet
- Interview
- Investment
- law enforcement
- Legal matters
- Library automation
- Management
- Marketing
- Mathematics
- Metadata
- Microsoft
- Mobile
- Natural language processing
- News
- NGIA
- Online (general)
- Open Access
- Open source
- OSINT
- Osint Radar
- Overflight
- Palantir
- Patents
- Personnel
- Podcast
- Policeware
- Portals
- Predictive coding
- Privacy
- Profile
- Publishing
- Quotation
- Real time search
- Reference tool
- Rich media
- Robot Writer
- Search
- Search enabled applications
- search engine
- Search quality
- Security
- Semantic
- Sentiment analysis
- SEO
- SharePoint
- Short Honks
- Smart Technology
- Social
- Social Media
- software
- Statistics
- Taxonomy
- Technology
- Text analytics
- Text processing
- Tools
- Tor
- Training
- Translation
- Twitter
- Uncategorized
- Unstructured Data
- User experience
- User Interface
- Vertical search
- Video
- visualization
- Voice search
- Voice technology
- Web 3
- Web Services
- Webinar
- Windows
- Work flow
- XML
- Yahoo

Beyond Search

MicroStrategy: TSA Swims through Data with PIMS

Googzilla Swallows Telegraph Media Group

Microsoft’s Vietnam: Search

Microsoft: 1999 to 2008

The Economics of Dealing with Complex Information

Search: The Three Curves of Despair

Indexing Hot Spots

Document Processing: Transformation Hot Spots

The Content Acquisition Hot Spots

Search System Bottlenecks

Search the site

Categories

Archives

Recent Posts

Meta

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Search the site

Categories

Archives

Recent Posts

Meta