Data Augmentation: Is a Step Missing or Mislocated?

August 6, 2014

I read “Data Warehouse Augmentation, Part 4.” You can find the write up a http://ibm.co/1obWXDh. There are other sections of the write, but I want to focus on the diagrams in this fourth chapter/section.

IBM is working overtime to generate additional revenues. Some of the ideas are surprising; for example, positioning Vivisimo’s metasearch function as a Big Data solution or buying Cybertap and then making the quite valuable technology impossible to find unless one is an intelligence procurement official. Then there is Watson, and I am just not up to commenting on this natural language processing system.

To the matter at hand. There is basic information about in this write up about specific technical components of a Big Data solution. The words, for the most part, will not surprise anyone who has looked at marketing collateral from any of the Big Data vendors/integrators.

What is fascinating about the write up is the wealth of diagrams in the document. I worked through the text and the diagrams and I noticed that one task is not identified as important; specifically, the conversion of source content into a file type or form that the content processing system can process.

Here’s an example. First the IBM diagram:

Source: IBM, Data Warehouse Augmentation, 2014.

Notice that after “staging”, there is a function described in time-honored database speak, “ETL.” Now “extract, transform, and load” is a very important step. But is there a step that precedes ETL?

How can one extract from disparate content if a connector is not available or the source system cannot support file transfers, direct access, or reports that reflect in memory data?

In my experience, there will be different methods of acquiring content to process. There are internal systems. If there is an ancient AS/400, some work is required to generate outputs that provide the data required. Due to the nature of the AS/400, direct interaction with the outstanding memory system of the AS/400, some care is needed to get the data and the updates not yet written to disc without corrupting the in memory information. We have addressed this “memory fragility” by using a standalone machine that accepts an output from the AS/400 and then disconnects. The indexing system, then, connects to the standalone machine to pick up the AS/400 outputs. Clunky? You bet. But there are some upsides. To learn about the excitement of direct interaction with AS/400, just do some real time data acquisition. Let me know how this works out for you.

The same type of care is often needed with the content assembled for the data warehouse pipeline. Let me illustrate this. Assume the data warehouse will obtain data from these sources: internal legacy systems, third party providers, custom crawls with the content residing on a hosted service, and direct data acquisition from mobile devices that feed information into a collection point parked at Amazon.

Now each of these content streams has different feathers in its war bonnet. Some of the data will be well formed XML. Some will be JSON. Some will be a proprietary format unique to the source. For each file type, there will be examples of content objects that are different, due to a vendor format change or a glitch.

These disparate content objects, therefore, have to be processed before extraction can occur. So has IBM put ETL in the wrong place in this diagram or has IBM omitted the pre-processing (normalization) operation.

In our experience, content that cannot be processed is not available to the system. If big chunks of content end up in the exceptions folder, the resulting content processing may be flawed. One of the data points that must be checked is the number of content objects that can be normalized in a pre processing stream. We have encountered situations like these. Your mileage may vary:

Entire streams of certain types of content are exceptions, so the resulting indexing does not contain the data. Example: outputs from certain intercept systems.
Streams of content skip non processable content without writing exceptions to a file due to configuration or resource availability
Streams of content are automatically “capped” when the processing system cannot keep pace. When the system accepts more content, it does not pull information from a cache or storage pool. The system just ignores the information it was unable to process.

There are fixes for each of these situations. What we have learned is that this pre processing function can be very expensive, have an impact on the reliability of the outputs from the data warehousing system when queried, and generate a bottleneck that affects downstream processes.

After decades of data warehousing refinement, why does this problem keep surfacing?

The answer is that recycling traditional thinking about content processing is much easier than figuring out what causes a complex system to derail itself. I think that may be part of the reason the IBM diagram may be misleading.

Pre-processing can be time consuming, hungry for machine resources, and very expensive to implement.

Stephen E Arnold, August 6, 2014

Written by Stephen E. Arnold · Filed Under Big data, News

Comments

Comments are closed.

Search the site
Subscribe to Beyond Search
Feature archive
News archive

Stephen E. Arnold monitors search, content processing, text mining and related topics from his high-tech nerve center in rural Kentucky. He tries to winnow the goose feathers from the giblets. He works with colleagues worldwide to make this Web log useful to those who want to go "beyond search". Contact him at sa [at] arnoldit.com. His Web site with additional information about search is arnoldit.com.