Transformation: An Emerging “Hot” Niche

January 25, 2008

Transformation is a $5 dollar word that means changing a file from one format to another. The trade magazines and professional writers often use data integration or normalization to refer to what amounts to taking a Word 97 document with a Dot DOC extension and turning it into a structured document in XML. These big words and phrases refer to a significant gotcha in behind-the-firewall search, content processing, and plain old moving information from one system to another.

Here’s a simple example of the headaches associated with what should be a seamless, invisible process after a half century of computing. The story:

You buy a new computer. Maybe a Windows laptop or a new Mac. You load a copy of Office 2007, write a proposal, save the file, and attach it to an email that says, “I’ve drafted the proposal we have to submit tomorrow before 10 am.” You send the email and go out with some friends.

In the midst of a friendly discussion about the merits of US democratic presidential contenders, your mobile rings. You hear your boss saying over the background noise, “You sent me a file I can’t open. I need the file. Where are you? In a bar? Do you have your computer so you can resend the file? No? Just get it done now!” Click here to read what ITWorld has to say on this subject. Also, there’s some user vitriol over Word to Word compatibiity hassle itself here. A work around from Tech Addict is here

Another scenario is to have a powerful new content processing system that churns through, according to the vendor’s technical specification, “more than 200 common file types.” You set up the content processing gizmo, aim it at the marketing department’s server, and click “Index.” You go home. When you arrive the next morning at 8 am, you find that the 60,000 documents in the folders containing what you wanted indexed had become an index with 30,000 documents.” Where are the other 30,000 documents? After a bit of fiddling, you discover the exception log and find that half of the documents you wanted indexed were not processed. You look up the error code and learn that it means, “File type not supported.”

The culprit is the inability of one system to recognize and process a file. The reasons for the exceptions are many and often subtle. Let’s troubleshoot the first problem, the boss’s inability to open a Word 2007 file sent as an attachment to an email.

The problem is that the recipient is using an older version of Word. The sender saved the file in the most recent Word’s version of XML. You can recognize these files by their extension Dot DOCX. What the sender should have done is save the [a] proposal as either a Dot DOC file in an older “flavor” of Word’s DOC format; [b] file as the now long-in-the-tooth RTF (rich text format) type; or [c] file in Dot TXT (ASCII) format. The fix is for the sender to resend the file in a format the recipient can view. But that one file can cost a person credibility points or the company a contract.

The second scenario is more complicated. The marketing department’s server had a combination of Word files, Adobe Portable Document Format files with Dot PDF extensions, some Adobe InDesign files, some Quark Express files, some Framemaker files, and some database files produced on a system no one knows much about except that the files came from a system no longer used by marketing. A bit of manual exploration revealed that the Adobe PDF files were password protected, so the content processing system rejected them. The content processing system lacked import filters to open the proprietary page layout and publishing program files. So it rejected them. The mysterious files from the disused system were data dumps from an IBM CICS system. The content processing system opened and then found them unreadable, so those were exceptions as well.

Now the nettles, painful nettles:

First, fixing the problem with any one file is disruptive but usually doable. The reputation damage done may or may not be repaired. At the very least, the sender’s evening was ruined, but the high-powered vice president was with a gaggle of upper crust types arguing about an election’s impact on trust funds. To “fix” the problem, she had to redo her work. Time consuming and annoying to leave her friends. The recipient — a senior VP — had to jiggle his plans in order to meet the 10 am deadline. Instead of chlling with The Simpsons TV show, he had to dive into the proposal and shape the numbers under the  pressure of the looming deadline.

We can now appreciate a 30,000 file problem. It is a very big problem. There’s probably no way to get the passwords to open some the PDFs. So, the PDFs’ content may remain unreadable. The weird publishing formats have to be opened in the application that created them and then exported in a file format the content processing system understands. This is a tricky problem, maybe another Web log posting. An alternative is to print out hard copies of the files, scan them, use optical character recognition software to create ASCII versions, and then feed the ASCII versions of the files to the content processing system. (Note: some vendors make paper-to-ASCII systems to handle this type of problem.) Those IBM CICS files can be recovered, but an outside vendor may be needed if the system producing the files is no longer available in house. When the costs are added up, these 30,000 files can represent hundreds of hours of tedious work. Figure $60 per hour and a week’s work if everything goes smoothly, and you can estimate the minimum budget “hit”. No one knows the final cost because transformation is dicey. Cost naivety is the reason my blood pressure spikes when a vendor asserts, “Our system will index all the information in your organization.” That’s baloney. You don’t know what will or won’t be indexed unless you perform a thorough inventory of files and their types and then run tests on a sample of each document type. That just doesn’t happen very often in my experience.

Now you know what transformation is. It is a formal process of converting lead into content gold.

One Google wizard — whose name I will withhold so Google’s legions of super-attorneys don’t flock to rural Kentucky to get the sheriff to lock me up — estimated that up to 30 percent of information technology budgets is consumed by transformation. So for a certain chicken company’s $17 million IT budget, the transformation bill could be in the $5 to $6 million range. That translates to selling a heck of a lot of fried chicken. Let’s assume the wizard is wrong by a factor of two. This means that $2 to $3 million is gnawed by transformation.

As organizations generate and absorb more digital information, what happens to transformation costs? The costs will go up. Whether the Google wizard is right or wrong, transformation is an issue that needs experienced hands minding the store.

The trigger for these two examples is a news item that the former president of Fast Search & Transfer, Ali Riaz, has started a new search company. Its USP (unique selling proposition) is data integration plus search and content processing. You can read Information Week‘s take on this new company here.

In Beyond Search, I discuss a number of companies and their ability to transform and integrate data. If you haven’t experienced the thrill of a transformation job, a data integration project, or a structured data normalization task — you will. Transformation is going to be a hot niche for the next few years.

Understanding of what can be done with existing digital information is, in general, wide and shallow. Transformation demands narrow and deep understanding of a number of esoteric and almost insanely diabolical issues. Let me identify three from own personal experience learned at the street academy called Our Lady of Saint Transformation.

First, each publishing system has its own peculiarities about files produced by different versions of itself. InDesign 1.0 and 2.0 cannot open the most recent version’s files. There’s a work around, but unless you are “into” InDesign, you have to climb this learning curve and fast. I’m not picking on Adobe. The same intra-program compatibilities plague Quark, PageMaker, the moribund Ventura, Framemaker, and some high-end professional publishing systems.

Second, data files spit out by mainframe systems can be fun for a 20-something. There are some interesting data formats still in daily use. EBCDIC or Extended Binary-Coded Decimal Interchange Code is something some readers can learn to love. It is either that or figuring out how to fire up an IBM mainframe, reinstalling the application (good luck on that one, 20 somethings), restoring the data from a DASD or flat file back up tapes (another fun task for a recent computer science grad), and then outputting something the zippy new search or content processing can convert in a meaningful way. (Note: “meaningful way” is important because when a filter gets confused, it produces some interesting metadata. Some glitches can require you to reindex the content if your index restore won’t work.)

Third, the Adobe PDFs with their two layers of security can be especially interesting. If you have one level of password, you can open the file and maybe print it, and copy some content from it. Or, not. If not, you either print the PDFs (if printing has not be disabled) , and go through the OCR-to-ASCII drill. In my opinion, PDFs are like a digital albatross. These birds hang around one’s neck. Your colleagues want to “search” for the PDFs’ content in their behind-the-firewall system. When asked to produce the needed passwords, I often hear something discomforting from the marketing department. So it is no surprise to learn that some system users are not too happy.

You may find this post disheartening.

No!

This post is chock full of really good news. It makes clear that companies in the business of transformation are going to have more customers in 2008 and 2009. It’s good news for off-shore conversion shops. Companies that have potent transformation tools are going to have a growing list of prospects. Young college grads get more chances to learn the mainframe’s idiosyncrasies.

The only negative in this rosy scenario is for the individual who:

  • Fails to audit the file types and the amount of content in those file types
  • Skips determining which content must be transformed before the new system is activated
  • Ignores the budget implications of transformation
  • Assumes that 200 or 300 filters will do the job
  • Does not understand the implications behind a vendor’s statement along these line: “Our engineers can create a custom filter for you if you don’t have time to do that scripting yourself.”

One final point: those 200 or more file types. Vendors talk about them with gusto. Check to see if the vendor is licensing filters from a third party. In certain situations, the included file type filters don’t support some of the more recent applications’ file formats. Other vendors “roll their own” filters. But filters can vary in efficacy because different people write them at different times with different capabilities. Try as they might, vendors can’t squash some of the filter nits and bugs. When you do some investigating, you may be able to substantiate my data that suggest filters work on about two thirds of the files you feed into the search or content processing system. Your investigation may prove my data incorrect. No problem. When you are processing 250,000 documents, the exception file becomes chunky from the system’s two to three percent rejection rate. A thirty percent rate can be a show stopper.

Stephen E. Arnold, January 25, 2008

Two Visions of the Future from the U.K.

January 17, 2008

Two different news items offered insights about the future of online. My focus is the limitations of key word search. I downloaded both articles, I must admit, eager to see if my research were disproved or augmented.

Whitebread

The first report appeared on January 14, 2008, in the (London) Times online in a news story “White Bread for Young Minds, Says University Professor.” In the intervening 72 hours, numerous comments appeared. The catch phrase is the coinage of Tara Brabazon, professor of Media Studies at the University of Brighton. She allegedly prohibits her students from using Google for research. The metaphor connotes in a memorable way a statement attributed to her in the Times’s article: “Google is filling, but it does not necessarily offer nutritional content.”

The argument strikes a chord with me because [a] I am a dinosaur, preferring warm thoughts about “the way it was” as the snow of time accretes on my shoulders; [b] schools are perceived to be in decline because it seems that some young people don’t read, ignore newspapers except for the sporty pictures that enliven gray pages of newsprint, and can’t do mathematics reliably at take-away shops; and [c] I respond to the charm of a “sky is falling” argument.

Ms. Brabazon’s argument is solid. Libraries seem to be morphing into Starbuck’s with more free media on offer. Google–the icon of “I’m feeling lucky” research–allows almost anyone to locate information on a topic regardless of its obscurity or commonness. I find myself flipping my dinosaurian tail out of the way to get the telephone number of the local tire shop, check the weather instead of looking out my window, and converting worthless dollars into high-value pounds. Why remember? Google or Live.com or Yahoo are there to do the heavy lifting for me.

Educators are in the business of transmitting certain skills to students. When digital technology seeps into the process, the hegemony begins to erode, so the argument goes. Ms. Brabazon joins Neil Postman Amusing Ourselves to Death: Public Discourse in the Age of Show Business, 1985) and more recently Andrew Keen (The Cult of the Amateur, 2007) among others in documenting the emergence of what I call the “inattention economy.”

I don’t like the loss of what weird expertise I possessed that allowed me to get good grades the old-fashioned way, but it’s reality. The notion that Google is more than an online service is interesting. I have argued in my two Google studies that Google is indeed much more than a Web search system growing fat on advertisers’ money. My research reveals little about Google as a corrosive effect on a teacher’s ability to get students to do their work using a range of research tools. Who wouldn’t use an online service to locate a journal article or book? I remember how comfortable my little study nook was in the rat hole in which I lived as a student, then slogging through the Illinois winter, dealing with the Easter egg hunt in the library stuffed with physical books that were never shelved in sequence, and manually taking notes or feeding 10-cent coins into a foul-smelling photocopy machine that rarely produced a readable copy. Give me my laptop and a high-speed Internet connection. I’m a dinosaur, and I don’t want to go back to my research roots. I am confident that the professor who shaped my research style–Professor William Gillis, may he rest in peace–neither knew nor cared how I gathered my information, performed my analyses, and assembled the blather that whizzed me through university and graduate school.

If a dinosaur can figure out a better way, Tefloned along by Google, a savvy teen will too. Draw your own conclusions about the “whitebread” argument, but it does reinforce my research that suggests a powerful “pull” exists for search systems that work better, faster, and more intelligently than those today. Where there’s a market pull, there’s change. So, the notion of going back to the days of taking class notes on wax in wooden frames and wandering with a professor under the lemon trees is charming but irrelevant.

The Researcher of the Future

The British Library is a highly-regarded, venerable institution. Some of its managers have great confidence that their perception of online in general and Google in particular is informed, substantiated by facts, and well-considered. The Library’s Web site offers a summary of a new study called (and I’m not sure of the bibliographic niceties for this title): A Ciber [sic] Briefing Paper. Information Behaviour of the Researcher of the Future, 11 January 2008. My system’s spelling checker is flashing madly regarding the spelling of cyber as ciber, but I’m certainly not intellectually as sharp as the erudite folks at the British Library, living in rural Kentucky and working by the light of buring coal. You can download this 1.67 megabyte 35 page document Researcher of the Future.

The British Library’s Web site article identifies the key point of the study as “research-behaviour traits that are commonly associated with younger users — impatience in search and navigation, and zero tolerance for any delay in satisfying their information needs — are now becoming the norm for all age-groups, from younger pupils and undergraduates through to professors.” The British Library has learned that online is changing research habits. (As I noted in the first section of this essay, an old dinosaur like me figured out that doing research online faster, easier, and cheaper than playing “Find the Info” in my university’s library.)

My reading of this weirdly formatted document, which looks as if it was a PowerPoint presentation converted to a handout, identified several other important points. Let me share my reading of this unusual study’s findings with you:

  1. The study was a “virtual longitudinal study”. My take on this is that the researchers did the type of work identified as questionable in the “whitebread” argument summarized in the first section of the paper. If the British Library does “Googley research”, I posit that Ms. Brabazon’s and other defenders of the “right way” to do research have lost their battle. Score: 1 for Google-Live.com-Yahoo. Nil for Ms. Brabazon and the British Library.
  2. Libraries will be affected by the shift to online, virtualization, pervasive computing, and other impedimentia of the modern world for affluent people. Score 1 for Google-Live.com-Yahoo. Nil for Mr. Brabazon, nil for the British Library, nil for traditional libraries. I bet librarians reading this study will be really surprised to hear that traditional libraries have been affected by the online revolution.
  3. The Google generation is comprised of “expert searchers”. The reader learns that most people are lousy searchers. Companies developing new search systems are working overtime to create smarter search systems because most online users–forget about age, gentle reader–are really terrible searchers and researchers. The “fix” is computational intelligence in the search systems, not in the users. Score 1 more for Google-Live.com-Yahoo and any other search vendor. Nil for the British Library, nil for traditional education. Give Ms. Brabazon a bonus point because she reached her conclusion without spending money for the CIBER researchers to “validate” the change in learning behavior.
  4. The future is “a unified Web culture,” more digital content, eBooks, and the Semantic Web. The word unified stopped my ageing synapses. My research yielded data that suggest the emergence of monopolies in certain functions, and increasing fragmentation of information and markets. Unified is not a word I can apply to the online landscape.In my BearStearns’ report published in 2007 as Google’s Semantic Web: The Radical Change Coming to Search and the Profound Implications to Yahoo & Microsoft, I revealed that Google wants to become the Semantic Web.

Wrap Up

I look forward to heated debate about Google’s role in “whitebreading” youth. (Sounds similar to waterboarding, doesn’t it?) I also hunger for more reports from CIBER, the British Library, and folks a heck of lot smarter than I am. Nevertheless, my Beyond Search study will assert the following:

  1. Search has to get smarter. Most users aren’t progressing as rapidly as young information retrieval experts.
  2. The traditional ways of doing research, meeting people, even conversing are being altered as information flows course through thought and action.
  3. The future is going to be different from what big thinkers posit.

Traditional libraries will be buffeted by bits and bytes and Boards of Directors who favor quill pens and scratching on shards. Publishers want their old monopolies back. Universities want that darned trivium too. These are notions I support but recognize that the odds are indeed long.

Stephen E. Arnold, January 17, 2008

Google 2008 Publishing Output

January 1, 2008

If you had any doubt about Google’s publishing activities, check out “Google Blogging in 2008”. The article by Susan Straccia here provides a run down of the GOOG’s self publishing output. Google has more than 120 Web logs. The article crows about the number of unique visitors and tosses in some Googley references to Google fun. Pushing the baloney aside, the message is clear: Google has an effective, global publishing operation focused exclusively on promoting Google. Toss in the Google Channel on YouTube.com, and the GOOG has a communication, promotion, distribution mechanism that few of its rivals can match. In my opinion, not even a major TV network in the US can reach as many eyeballs as quickly and cheaply as Googzilla. Competitors have to find a way to match this promotional 30mm automatic Boeing M230 chain gun.

Stephen Arnold, January 1, 2009

« Previous Page

  • Archives

  • Recent Posts

  • Meta