More Data Truths: Painful Stuff
July 4, 2016
I read “Don’t Let Your Data Lake Turn into a Data Swamp.” Nice idea, but there may be a problem which resists some folks’ best efforts to convert that dicey digital real estate into a tidy subdivision. Swamps are wetlands. As water levels change, the swamps come and go, ebb and flow as it were. More annoying is the fact that swamps are not homogeneous. Fens, muskegs, and bogs add variety to the happy hiker who strays into the Vasyugan Swamp as the spring thaw progresses.
The notion of a data swamp is an interesting one. I am not certain how zeros and ones in a storage medium relate to the Okavango delta, but let’s give this metaphor a go. The write up reveals:
Data does not move easily. This truth has plagued the world of Big Data for some time and will continue to do so. In the end, the laws of physics dictate a speed limit, no matter what else is done. However, somewhere between data at rest and the speed of light, there are many processes that must be performed to make data mobile and useful. Integrating data and managing a data pipeline are two of these necessary tasks.
Okay, no swamp thing here.
The write up shifts gears and introduces the “data pipeline” and the concept of “keeping the data lake clean.”
Let’s step back. What seems to be the motive force for this item about information in digital form has several gears:
- Large volumes of data are a mess. Okay, but not all swamps are messes. The real problem is that whoever stored data did it without figuring out what to do with the information. Collection is not application.
- The notion of a data pipeline implies movement of information from Point A to Point B or through a series of processes which convert Input A into Output B. Data pipelines are easy to talk about, but in my experience these require knowing what one wants to achieve and then constructing a system to deliver. Talking about a data pipeline is not a data pipeline in my wetland.
- The concept of pollution seems to suggest that dirty data are bad. Making certain data are accurate and normalized requires effort.
My view is that this write up is trying to communicate the fact that Big Data is not too helpful if one does not take care of the planning before clogging a storage subsystem with digital information.
Seems obvious but I suppose that’s why we have Love Canals and an ever efficient Environmental Protection Agency to clean up shortcuts.
Stephen E Arnold, July 4, 2016