The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry

October 6, 2022

Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?

No. Good. Enjoy the outputs.

Yes. Too bad. You lose.

For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”

The sub title is:

A new solution for data hungry AIs

And the sub sub title is:

Content provided by IBM and TNW.

Let’s check out what this IBM content marketing write up says:

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

What are the downsides of synthetic data? Downsides? Don’t be silly:

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

There is one, very small, almost miniscule issue stated in the write up; to wit:

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.

But what about that disaffected MBA developer fiddling with thresholds?

I know the answer to this question, “That will never happen.”

Okay, I am convinced. You know the “we need” thing.

Stephen E Arnold, October 6, 2022

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta