Turning to AI for Better Data Hygiene
December 28, 2017
Most big data is flawed in some way, because humans are imperfect beings. That is the premise behind ZDNet’s article, “The Great Data Science Hope: Machine Learning Can Cure Your Terrible Data Hygiene.” Editor-in-Chief Larry Dignan explains:
The reality is enterprises haven’t been creating data dictionaries, meta data and clean information for years. Sure, this data hygiene effort may have improved a bit, but let’s get real: Humans aren’t up for the job and never have been. ZDNet’s Andrew Brust put it succinctly: Humans aren’t meticulous enough. And without clean data, a data scientist can’t create algorithms or a model for analytics.
Luckily, technology vendors have a magic elixir to sell you…again. The latest concept is to create an abstraction layer that can manage your data, bring analytics to the masses and use machine learning to make predictions and create business value. And the grand setup for this analytics nirvana is to use machine learning to do all the work that enterprises have neglected.
I know you’ve heard this before. The last magic box was the data lake where you’d throw in all of your information–structured and unstructured–and then use a Hadoop cluster and a few other technologies to make sense of it all. Before big data, the data warehouse was going to give you insights and solve all your problems along with business intelligence and enterprise resource planning. But without data hygiene in the first place enterprises replicated a familiar, but failed strategy: Poop in. Poop out.
What the observation lacks in eloquence it makes up for in insight—the whole data-lake concept was flawed from the start since it did not give adequate attention to data preparation. Dignan cites IBM’s Watson Data Platform as an example of the new machine-learning-based cleanup tools, and points to other noteworthy vendors investigating similar ideas—Alation, Io-Tahoe, Cloudera, and HortonWorks. Which cleaning tool will perform best remains to be seen, but Dignan seems sure of one thing—the data that enterprises have been diligently collecting for the last several years is as dirty as a dustbin lid.
Cynthia Murrell, December 28, 2017