Transformation: An Emerging “Hot” Niche

January 25, 2008

Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.

Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:

You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.

In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here

Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”

The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.

The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.

The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.

Now the nettles, painful nettles:

First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under theÂ pressure of the looming deadline.

We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.

Now you know what transformation is. It is a formal process of converting lead into content gold.

One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.

As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.

The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.

In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.

Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.

First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.

Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)

Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.

You may find this post disheartening.

No!

This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.

The only negative in this rosy scenario is for the individual who:

Fails to audit the file types and the amount of content in those file types
Skips determining which content must be transformed before the new system is activated
Ignores the budget implications of transformation
Assumes that 200 or 300 filters will do the job
Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”

One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.

Stephen E. Arnold, January 25, 2008

Written by Stephen E. Arnold · Filed Under Online (general)

Comments

One Response to “Transformation: An Emerging “Hot” Niche”

David Eddy on January 25th, 2008 12:44 pm

Steve –

Excellent discussion of the complexities of file conversions.

Here are two lists/sites of file extensions:
http://filext.com/index.php
http://whatis.techtarget.com/file-extension-list-A/0,289933,sid9,00.html

Way bigger than 200.

– David

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.