Synthetic Data: Yes, They Are a Thing
March 13, 2023
“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.
Enter synthetic data.
If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:
The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.
Helpful, correct?
The article does not point out two things which I find of interest.
First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.
Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.
I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.
Stephen E Arnold, March 13, 2023