A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers
October 25, 2022
In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:
Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.
The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.
The write up reports:
Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.
The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.
My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.
Several observations:
- The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
- Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
- Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.
Does this suggest that a problem exists? If yes, what’s the fix?
[a] Ignore the problem
[b] Trust Google-like outfits who seek to be the source for synthetic data
[c] Rely on MBAs
[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities
[e] Blink.
Pick one.
Stephen E Arnold, October 25, 2022