In Big Data, Bad Data Does Not Matter. Not So Fast, Mr. Slick
April 8, 2024
This essay is the work of a dumb dinobaby. No smart software required.
When I hear “With big data, bad data does not matter. It’s the law of big numbers. Relax,” I chuckle. Most data present challenges. First, figuring out which data are accurate can be a challenge. But the notion of “relax,” does not cheer me. Then one can consider data which have been screwed up by a bad actor, a careless graduate student, a low-rent research outfit, or someone who thinks errors are not possible.
The young vendor is confident that his tomatoes and bananas are top quality. The color of the fruit means nothing. Thanks, MSFT Copilot. Good enough, like the spoiled bananas.
“Data Quality Getting Worse, Report Says” offers some data (which may or may not be on the mark) which remind me to be skeptical of information available today. The Datanami article points out:
According to the company’s [DBT Labs’] State of Analytics Engineering 2024 report released yesterday, poor data quality was the number one concern of the 456 analytics engineers, data engineers, data analysts, and other data professionals who took the survey. The report shows that 57% of survey respondents rated data quality as one of the three most challenging aspects of the data preparation process. That’s a significant increase from the 2022 State of Analytics Engineering report, when 41% indicated poor data quality was one of the top three challenges.
The write up offers several other items of interest; for example:
- Questions about who owns the data
- Integration of fusion of multiple data sources
- Documenting data products; that is, the editorial policy of the producer / collector of the information.
This flashing yellow light about data seems to be getting brighter. The implication of the report is that data quality “appears” to be be heading downhill. The write up quotes Jignesh Patel, computer science professor at Carnegie Mellon University to underscore the issue:
“Data will never be fully clean. You’re always going to need some ETL [extract, transform, and load] portion. The reason that data quality will never be a “solved problem,” is partly because data will always be collected from various sources in various ways, and partly because or data quality lies in the eye of the beholder. You’re always collecting more and more data. If you can find a way to get more data, and no one says no to it, it’s always going to be messy. It’s always going to be dirty.”
But what about the assertion that in big data, bad data will be a minor problem. That assertion may be based on a lack of knowledge about some of the weak spots in data gathering processes. In the last six months, my team and I have encountered these issues:
- The source of the data contained a flaw so that it was impossible to determine what items were candidates for filtering out
- The aggregator had zero controls because it acquired data from another party and did not homework other than hyping a new data set
- Flawed data filled the exception folder with a large percentage of the information that remediation was not possible due to time and cost constraints
- Automated systems are indiscriminate, and few (sometimes no one) pay close attention to inputs.
I agree that data quality is a concern. However, efficiency trumps old-fashioned controls and checks applied via subject matter experts and trained specialists. The fix will be smart software which will be cheaper and more opaque. The assumption that big data will be self healing may not be accurate, but it sounds good.
Stephen E Arnold, April 8, 2024