Data Intake: Still a Hassle
April 21, 2016
I read “Big Data’s Biggest Problem: It’s Too Hard to Get the Data In.” Here’s a quote I noted:
According to a study by data integration specialist Xplenty, a third of business intelligence professionals spend 50% to 90% of their time cleaning up raw data and preparing to input it into the company’s data platforms. That probably has a lot to do with why only 28% of companies think they are generating strategic value from their data.
My hunch is that with the exciting hyperbole about Big Data, the problem of normalizing, cleaning, and importing data is ignored. The challenge of taking file A in a particular file format and converting to another file type is indeed a hassle. A number of companies offer expensive filters to perform this task. The one I remember is Outside In, which sort of worked. I recall that when odd ball characters appeared in the file, there would be some issues. (Does anyone remember XyWrite?) Stellent purchased Outside In in order to move content into that firm’s content management system. Oracle purchased Stellent in 2006. Then Kapow “popped” on the scene. The firm promoted lots of functionality, but I remember it as a vendor who offered software which could take a file in one format and convert it into another format. Kofax (yep, the scanner oriented outfit) bought Kofax to move content from one format into one that Kofax systems could process. Then Lexmark bought Kofax and ended up with Kapow. With that deal, Palantir and other users of the Kapow technology probably had a nervous moment or are now having a nervous moment as Lexmark marches toward a new owner. Entropy, a French outfit, was a file conversion outfit. It sold out to Salesforce. Once again, converting files from Type A to another desired format seems to have been the motivating factor.
Let us not forget the wonderful file conversion tools baked into software. I can save a Word file as an RTF file. I can import a comma separated file into Excel. I can even fire up Framemaker and save a Dot fm file as RTF. In fact, many programs offer these import and export options. The idea is to lessen the pain of have a file in one format which another system cannot handle. Hey, for fun, try opening a macro filled XyWrite file in Framemaker or Indesign. Just change the file extension to one the system thinks it recognizes. This is indeed entertaining.
The write up is not interested in the companies which have sold for big bucks because their technology could make file conversion a walk in the Hounz Lane Park. (Watch out for the rats, gentle reader.) The write up points out three developments which will make the file intake issues go away:
- The software performing file conversion “gets better.” Okay, I have been waiting for decades for this happy time to arrive. No joy at the moment.
- “Data preparers become the paralegals of data science.” Now that’s a special idea. I am not clear on what a “data preparer” is, but it sounds like a task that will be outsourced pretty quickly to some country far from the home of NASCAR.
- Artificial intelligence” will help cleanse data. Excuse me, but smart software has been operative in file conversion methods for quite a while. In my experience, the exception files keep on piling up.
What is the problem with file conversion? I don’t want to convert this free blog post into a lengthy explanation. I can highlight five issues which have plagued me and my work in file conversion for many years:
First, file types change over time. Some of the changes are not announced. Others like the Microsoft Word XML thing are the subject of months long marketing., The problem is that unless the outfit responsible for the file conversion system creates a fix, the exception files can overrun a system’s capacity to keep track of problems. If someone is asleep at the switch, data in the exception folder can have an adverse impact on some production systems. Loss of data is interesting but trashing the file structure is a carnival. Who does not pay attention? In my experience, vendors, licensees, third parties, and probably most of the people responsible for a routine file conversion task.
Second, the thrill of XML is that it is not particularly consistent. Somewhere along the line, creativity takes precedence over for well formed. How does one deal with a couple hundred thousand XML files in an exception folder? What do you think about deleting them?
Third, the file conversion software works as long as the person creating a document does not use Fancy Dan “inserts” in the source document. Problems arise from videos, certain links, macros, and odd ball formatting of the source document. Yep, some folks create text in Excel and wonder why the resulting text is a bit of a mess.
Fourth, workflows get screwed up. A file conversion system is semi smart. If a process creates a file with an unrecognized extension, the file conversion system fills the exception folder. But what if one valid extension is changed to a supported but incorrect extension. Yep, XML users be aware that there are proprietary XML formats. The files converted and made available to a system are “sort of right.” Unfortunately sort of right in mission critical applications can have some interesting consequences.
Fifth, attention to detail is often less popular than fiddling with one’s mobile phone or reading Facebook posts. Human inattention can make large scale data conversion fail. I have watched as a person of my acquaintance deleted the folder of exception files. Yo, it is time for lunch.
So what? Smart software makes certain assumptions. At this time, file intake is perceived as a problem which has been solved. My view is that file intake is a core function which needs a little bit more attention. I do not need to be told that smart software will make file intake pain go away.
Stephen E Arnold, April 21, 2016