Content Transformation: A Challenge that Won’t Go Away

May 15, 2008

We live in a world of Web 2.0 and Web 3.0 goodness. At the Where 2.0 conference in Burlingame, California on May 14, 2008, I overheard this snippet of conversation:

We had everything working, but when we imported content, the system crashed. I reinstalled. I checked the config files. It still crashed. I have to open each file, resave it as an RTF, and import them one at a time. Grrrr.

Sound familiar?

I have heard this complaint many times before. In our content-savvy, XML-ized era, moving a source file into a content processing system should be trivial. The content processing system can extract entities. It can metatag. Some can slice, dice, and cook a chicken. But unless the system can intake content and transform it to something that the content processing subsystem understands, the system is dead in the water. Even worse, the text processing system only processes some of the source documents. In certain mission critical applications, kicking out documents is a no-no. Not only is the manual manipulation expensive, it’s time consuming. In those minutes or hours of fiddling, potentially significant data are not available to the analysts. What does missing information cost? Well, it depends on your work situation. In the Wall Street world, investment information can turn a win into a loss in a millisecond. In certain military applications, the information may mean the difference between health and harm.

Transforming a square into a circle or a circle into a square looks easy. With a triangle and a compasss you can create two objects. Its the intermediate steps that become tricky for an artist or a budding mathematician.

What is file or data transformation? In its simplest form, you have a file in Microsoft Word 2007 format, and you want to “transform” or change the file into a format recognized by another system’s import filter. So, one approach would be to open the File in Word 2007, click on File Save As, select RTF (Rich Text Format), and save the file. You can then allow your search or content processing system to suck the file into the conversion subsystem and turn the RTF into whatever target output format the filter generates. In a more sophisticated form, you take an unstructured document or a database table, and you transform it into some file type that your system can process. A more interesting task is to convert a file into a file with a comparable structure; for instance, take and SGML instance and convert it to HTML. Some search system vendors include filters and transformation tools with their system. Others provide an application programming interface. The idea is that you will write a script to perform whatever conversion you require, handle entities in an appropriate manner, and preserve the information and metadata (if available) throughout the process.

Let’s take a quick look at several transformation challenges and then step back to consider what steps you can follow to minimize these problems. Before jumping into the causes, keep in mind that as much as 30 percent of an information technology department’s budget is consumed by transformation costs. This astounding number surfaced in a presentation given by a Google engineer in 2007. If that number seems high, you can knock it down to a more acceptable 10 or 20 percent. The point is that fiddling with data when moving it from one system and format to another is a common task. Any transformation activity can go off the tracks.

Four Challenges Thrown at Me

First, let’s consider the “standard” for document sharing in many financial institutions—the PDF or Adobe Portable Document Format file. Adobe created a file format that could be opened on Unix, Windows, or Macintosh more than 15 years ago. Today, many applications generate a PDF when the user clicks the ubiquitous PDF icon in the application or does File Save As in the application.

The challenge is that there are different “flavors” of PDFs. When transforming PDFs, the content processing system may be able to index only those PDFs which have been saved with words in them. If the PDF is a TIFF or tagged image file format, the transformation routine may reject the document. If you are working with investment analyst reports, these often are image files, and you have to deal with sometimes long documents that your users want processed by the text mining system. A casual opening and closing of a PDF is not sufficient to reveal to you if the PDF is a TIFF image with the PDF wrapper or if it is a PDF with text. One further complication with PDFs is that some authors (including me) slap one or two passwords on a PDF to exert some control over repurposing of the content. I often disable the copy and print function of the PDF as well making it somewhat more difficult for a recipient of one of my limited-distribution reports to print out pages, OCR (optical character recognize) the text, and process the raw ASCII. If you are looking for a simple fix for the PDF problem, I don’t have one. For some important PDFs, the document may have to be rekeyed or a series of manual screen captures performed with images reconstituted for additional processing. If you find a better fix, let me know.

Second, there’s the challenge of Web pages. I’m not talking about Web pages in the “wild” on the public Internet. I’m talking about Web pages generated by departments in your organization. You have to import these Web pages, index them, and provide your colleagues with access or to reports that contain content from the internal Web pages. Numerous challenges exist because there is rarely one standard for creating in-house Web pages. Organizations tell me that there is one standard, but I know from looking at the exception files that what a boss tells me doesn’t match the content processing system’s rejected files. Some Web pages can’t be easily transformed for several reasons. [a] A person unfamiliar with the “regular” system uses the Web page export feature in an applications like Microsoft Word or Adobe Framemaker. These pages will render in most browsers, but the same pages can choke a content processing subsystem. Some versions of Windows ship with the Front Page Web page authoring tool. A summer intern can generate some pages that look like pages defined by the organization’s style manual, but these pages contain errant code that can break the content processing system. The fix for this problem is manual manipulation of rejected pages. If you can identify the specific problems, you can write a script to clean up these files. We use the UltraEdit tool for this purpose, but there are many editors that speed this work. [b] Your organization uses a system that’s long in the tooth. Broadvision systems still lurk in the US government. Pages output from Broadvision contain distinct urls each time a page is generated. For a single user, this is no problem. For a content processing system, you can find yourself looking at a stack of rejection with complex file names. You don’t know if the file is a duplicate or if it was rejected for the file name or some Broadvision tag the content processing system did not understand. Depending on how certain systems are configured, you can find intermittent rejections. Figuring out what happened, why, and how to remediate the problem can be trivial or tough.

Third, XML files. You know that there is no single XML “flavor”. In an organization you can find business XML, math XML, plain vanilla XML, XML that isn’t recognized as XML, and XML that is filled with errors. Unless you plan for XML transformation, you may find yourself looking at a folder of rejected files. The content processing system’s log file may contain the unhelpful message, “File nnnn rejected. Unrecognized format.” You may find that XML files that worked swimmingly on Tuesday are rejected on Wednesday. The cause for this problem can be a slipstreamed fix that surfaces only when the content processing system rejects the file.

Let me mention one “bonus” content processing problem. You know that writing a script to extract data from a database works. You submit a query, and you get an output. Now you have to move the output into the content processing subsystem. The file begins to process and then the system rejects the file. You inspect the output file and you don’t see any garbage characters. You rerun the script, and the content processing system rejects the file. The fix may be to change the format of the output file. You can run into problems using comma delimited files, possibly because a database cell contains an internal comma, and the content processing system sees the errant comma, starts processing, and then trashes subsequent data. You will find issues arising from output file size, dependencies such as pointers to external objects, and other quirks that make database output transformation interesting.,

Addressing the Problems

Let me offer you some ideas for solving these transformation problems. Before you look at these tips, keep in mind that transformation, file massaging, and format conversion can be intermittent and, therefore, expensive to troubleshoot and fix.

For PDF files, outsource conversion and transformation of rejected PDFs. If you try to manipulate the “flavors” of PDF yourself or using full time equivalents, you will lose control of costs. Understand that certain PDFs can only be transformed by rekeying. The pragmatic approach is to process only PDFs that your system recognizes unless you have the budget to achieve near comprehensive transformation.
For in-house Web page variance, you must be pragmatic and politically savvy. A standard system that is locked down to generate a specific type of Web page is the most reliable way to eliminate maverick coding. You can write scripts to del with the idiosyncrasies of pages from your content management system or systems. If there is not standard, you will face a random transformation job when an off-the-reservation page is submitted to the text processing system.
The XML problem is really tough. A standard document type definition is a great idea in theory, but it is difficult in reality. I have found that a variety of commercial tools (including some that are no longer available commercially) and custom scripts can be used to figure out the problem. If a group of XML documents are rejected, a pre-processing pipeline for these XML documents can be set up. If the problems are random, erratic, or intermittent, you can [a] set up a transformation procedure in your department, [b] outsource the transformation, or [c] reject the XML documents your system cannot transform. XML is often another file transformation headache, not a cure for some types of conversion problems.
The database problems are common. The good news is that once you know the quirks of the database export file, you can write a script to make these compatible with your content processing conversion system. Hooray!

To wrap up, you will find that transformation – the process of moving a file from one format to another—crucial to your budget and mental health. I’m old school in that I want to lock down variables. I want to do zero scripting. I want to have zero outsourcing going on. I’ve been trying to reach this goal for 30 years, and I still find transformation headaches. As soon as I figure out a Web page, some whiz kid generates a Flash interface and forces the CMS system to spit out this content object.

With transformation costs able to gobble as much as one-third of your budget, planning and requirements are becoming more important. Now we just have to get that 19-year-old intern to make content the “right way”. It’s good to be an optimist.

Stephen Arnold, May 15, 2008

Written by Stephen E. Arnold · Filed Under Database, Enterprise, Feature, Search, Text processing

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.