Document Processing: Transformation Hot Spots

February 23, 2008

Let’s pick up the thread of sluggish behind-the-firewall search systems. I want to look at one hot spot in the sub system responsible for document processing. Recall that the crawler sub system finds or accepts information. The notion of “find” relates to a crawler or spider able to identify new or changed information. The spider copies the information back to the content processing sub system. For the purposes of our discussion, we will simplify spidering to the find-and-send back approach. The other way to get content to the search system is to push it. The idea is that a small program wakes up when new or changed content is placed in a specific location on a server. The script “pushes” the content — that is, copies the information — to a specific storage area on the content processing sub system. So, we’re dealing with pushing or pulling content. The diagram to which these comments refer is here.

Now what happens?

There are many possible functions a vendor can place in the document processing subsystem. I want to focus on one key function — content transformation. Content transformation takes a file — let’s say a PowerPoint — and creates a version of this document in an XML structure “known” to the search system. The idea is that a number of different file types are found in an organization. These can range from garden variety Word 2003 files to the more exotic XyWrite files still in use at certain US government agencies. (Yes, I know that’s hard to believe because you may not know what XyWrite is.)

Most search system vendors say, “We support more than 200 different file types.” That’s true. Over the years, scripts that convert a source file of one type into an output file of another type have been written. Years ago, there were independent firms doing business as Data Junction and Outside In. These two companies, along with dozens of others, have been acquired. A vendor can license these tools from their owners. Also, there are a number of open source conversion and transformation tools available from Source Forge, shareware repositories, and from freeware distributors. However, a number of search system vendors will assert, “We wrote our own filters.” This is usually a way to differentiate their transformation tools from a competitor. The reality is that most vendors get use a combination of licensed tools, open source tools, and home-grown tools. The key point is the answer to two questions:

How well do these filters or transformation routines work on the specific content you want to have the search system make searchable?
How fast do these systems operate on the specific production machines you will use for content transformation?

The only way to answer these two questions with accuracy is to test the transformation throughput on your content and on the exact machines you will use in production. Any other approach will create a general throughput rate value that your production system may or may not be able to deliver. Isn’t it better to know what you can transform before you start processing content for real?

I’ve just identified the two reasons for unexpected bottlenecks and, hence, poor document processing performance. First, you have content that the vendor’s filters cannot handle. When a document processing sub system can’t figure out how to transform a file, it writes the file name, date, time, size, and maybe an error code in the document processing log. If you have too many rejected files, you have to intervene, figure out the problem with the files, and then take remedial action. Remedial action may mean re keying the file or going through some manual process of converting the file from its native format, to a neutral format like ASCII, doing to manual touch up like adding sub heads or tags, and then putting the touched up file into the document processing queue. Talk about a bottleneck. In most organizations, there is neither money nor people to do this work. Fixing the content transformation problems can take days, week, or never be done at all. Not surprisingly, a system that can’t process the content cannot make that content available to the system users. This glitch is a trivial problem when you are first deploying a system because you don’t have much knowledge of what will be transformed and what won’t. Imagine the magnitude of the problem when a transformation problem is discovered after the system is up and running. You may find log files over writing themselves. You may find “out of space” messages in the folder used by the system to write files that can’t be transformed. You may find intermittent errors cascading back through the content acquisition system due to transformation issues. Have you looked at your document processing log files today?

The second problem has to do with document processing hardware. In my experience exactly zero of the organizations with which I am familiar have run pre-deal tests on the exact hardware that will be used in production document processing. The exception are the organizations licensing appliances. The appliance vendors deliver hardware with a known capacity. Appliances, however, comprise less than 15 percent of the installed base of behind-the-firewall search systems. Most organizations’ information technology departments think that vendor estimates are good enough. Furthermore, most information technology groups believe that existing hardware and infrastructure are adequate for a search application. What happens? The system goes into operation and runs along until the volume of content to be proc3essed exceeds available resources. When that happens, the document processing sub system slows to a crawl or hangs.

Performance Erosion

Document processing is not a set-it and forget-it sub system. Let’s look at why you need to invest the time engineering, testing, monitoring, and upgrading the document processing sub system. I know before I summarize the information from my files that few, if any, readers of this Web log will take these actions. I must admit indifference to the document processing sub system generates significant revenue for consultants, but so many hassles can be avoided by taking some simple preventive actions. Sigh.

Let’s look at the causes of performance erosion:

The volume of content is increasing. Most organizations whose digital content production volume I have analyzed double their digital content every 12 months. This means that if one employee has five megabytes of new content when you turn on the system, that employee will have on her computer, 12 months after you start the search system, you will have the original five megabytes in the index and the new five megabytes for a total of 10 megabytes of content. No big deal, right? Storage is cheap. It is a big deal when you are working in an organization with constraints on storage, an inability to remove duplicate content from the index, and an indiscriminate content acquisition process. Some organizations can’t “plug in” new storage the way you can on a PC or Mac. Storage must be ordered, installed, and certified. In the meantime, what happens? The document processing system falls behind. Can it catch up? Maybe. Maybe not.
The content is not new. Employees recycle, save different drafts of documents, and merge pieces of boiler plate text to create new documents. Again, if you work on one PowerPoint, you can index any PowerPoint. But when you have many PowerPoints each with minor changes and the email messages like “Take a look at this an send me your changes”, you can index the same content again and again. A results list is not just filled with irrelevant hits; the basic function of search and retrieval is broken. Does your search system return a results list of what look like the same document with different date, time, and size values? How do you determine which version of the document is the “best and final” one? What are the risks of using the incorrect version of a document? How much does your organization spend on figuring out which version of a document is the “one” the CEO really needs?

As you wrestle with these questions, recall that you are shoving more content through a system which unless constantly upgraded will slow to a crawl. You have set the stage for thrashing. The available resources are being consumed processing the same information again and again, not processing the meaningful documents one time and again only when a significant change is made. Ah, you don’t know what documents are meaningful? You are now like the snake eating its tail. Because you don’t have an editorial policy or content acquisition procedures in place, you have found the slow down in document processing is nothing more than a consequence of an earlier misstep. So, no matter what you do to “fix” document processing, you won’t be able to get your search system working the way users want it to. Pretty depressing? Furthermore, senior management doesn’t understand why throwing money at a problem in document processing doesn’t have any significant pay off to justify the expense.

XML and Transformation

I’m not sure I can name a search vendor who does not support XML. XML is an incantation. Say it enough times, and I guess it becomes the magic fix to what ever ailments a content processing system has.

Let me give you my view of this XML baloney. First, XML or extensible mark up language is not a panacea. XML is, at its core, a programmatic approach to content. How many of you reading this column program anything in any language? Darn few. So the painful truth is, you don’t know how to “fix” or “create” a valid XML instance, but you sure sound great when you chatter about XML.

Second, XML is a simplified version of SGML which is in turn a decendent of CALS (computer aided logistics system) spawned by our insightful colleagues in the US government to deal with procurement. Lurking behind a nice Word document in the “new” DOCX format is a DTD, document type definition. But out of sight, out of mind, correct? Unfortunately, no.

Third, XML is like an ancient Roman wall in 25 BCE. The smooth surface conceals a heck of a lot of rubble between some rigid structures made of irregular brick or stone. This means that taking a “flavor” of XML and converting it to the XML that your search system understands is a programmatic process. This notion of converting a source file like a WordPerfect document into an XML version that the search system can use is pretty darn complicated., When it goes wacky, it’s just like debugging any other computer program. Who knows how easy or hard it will be to find and fix the error? Who knows how long it will take? Who knows how much it will cost? I sure don’t.

If we take these three comments and think about them, it’s evident that this document transformation can chew up some computer processing cycles. If a document can’t be transformed, the exception log can grow. Dealing with these exceptions is not something one does in a few spare minutes between meetings.

Nope.

XML is work, which when done properly, greatly increases the functionality of indexing sub systems. When done poorly, XML is just another search system nightmare.

Stepping Back

How can you head off these document processing / transformation challenges?

The first step is knowing about them. If your vendor has educated you, great. If you have learned from the school of hard knocks, that’s probably better. If you have researched search and ingested as much other information as you can, you go to the head of the class.

An increasing number of organizations are solving this throughput problem by: [a] ripping and replacing the incumbent search system. At best, this is a temporary fix; [b] shifting to an appliance model. This works pretty well, but you have to keep adding appliances to keep up with content growth and the procedure and policy issues will surface again unless addressed before the appliance is deployed; [c] shifting to a hosted solution. This is an up-and-coming fix because it outsources the problem and slithers away from the capital investment on-premises installations require.

Notice that I’m not suggesting slapping an adhesive bandage on your incumbent search system. A quick fix is not going to do much more than buy time. In Beyond Search, I go into some depth about vendors who can “wrap” your ailing search system with a life-support system. This approach is much better than a quick fix, but you will have to address the larger policy and procedural issues to make this hybrid solution work over the long term.

You are probably wondering how transforming a bunch of content can become such a headache. You have just learned something about the “hidden secrets” of behind-the-firewall search. You have to dig into a number of murky, complex areas before you make your search system “live.”

I think the following checklist has not been made available without charge before. You may find it useful, and if I have left something out, please, let me know via the comments function on this Web log.

How much information in what format must the search system acquire and transform on a monthly and annual basis?
What percent of the transformation is for new content? How much for changed content?
What percent of content that must be processed exists in what specific file types? Does our vendor’s transformation system handle this source material? What percent of documents cannot be handled?
What filters must be modified, tested, and integrated into the search systems?
What is the administrative procedure for dealing with [a] exceptions and [b] new file types such as an email with an unrecognized attachment?
What is the mechanism for determining what content is a valid version and which content is a duplication? What pre-indexing process must be created to minimize system cycles needed to identify duplicate content; that is, how can I get my colleagues to flag only content that should be indexed before the content is acquired by the document processing system?
What is the upgrade plan for the document processing sub system?
What content will not be processed if the document processing sub system slows? What is the procedure for processing excluded content when the document processing subsystem again has capacity?
What is the financial switch over point from on-premises search to an appliance or a hosted / managed service model?
What is the triage procedure when a document processing sub system degrades to an unacceptable level?
What’s the XML strategy for this search system? What does the vendor do to fix issues? What are my contingency plans and options when a problem becomes evident?

In another post, I want to look at hot spots in indexing. What’s intriguing is that so far we have brought or had content pushed to the search system storage devices. We have normalized content and written that content in a form the indexing system can understand to the storage sub system. Is any one keeping track of how many instances of a document we have in the search system at any one time? We need that number. If we run out of storage, we’re dead in the water.

This behind-the-firewall search is a no-brainer. Believe it or not, a senior technologist at a 10,000-person organization told me in late 2007, “Search is not that complicated.” That’s a guy who really knows his information retrieval limits!

Stephen Arnold, February 23, 2008

Written by Stephen E. Arnold · Filed Under Cost, Enterprise, Search

Comments

One Response to “Document Processing: Transformation Hot Spots”

nrksfywj gihrjx on August 11th, 2008 12:44 am

qrpsdkh jvpnxkyfq mfhb sclneft zmpdv knxtocshp ldehrn

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.