A Googley Rah Rah for Synthetic Data

April 27, 2023

Vea4_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I want to keep this short. I know from experience that most people don’t think too much about synthetic data. The idea is important, but other concepts are important and no one really cares too much. When was the last time Euler’s Number came up at lunch?

A gaggle of Googlers extoll the virtues of synthetic in a 19 page ArXiv document called “Synthetic Data from Diffusion Models Improves ImageNet Classification.” The main idea is that data derived from “real” data are an expedient way to improve some indexing tasks.

I am not sure that a quote from the paper will do much to elucidate this facet of the generative model world. The paper includes charts, graphs, references to math, footnotes, a few email addresses, some pictures, wonky jargon, and this conclusion:

And we have shown improvements to ImageNet classification accuracy extend to large amounts of generated data, across a range of ResNet and Transformer-based models.

The specific portion of this quote which is quite important in my experience is the segment “across a range of ResNet and Transformer-based models.” Translating to Harrod’s Creek lingo, I think the wizards are saying, “Synthetic data is really good for text too.”

What’s bubbling beneath the surface of this archly-written paper? Here are my answers to this question:

  1. Synthetic data are a heck of a lot cheaper to generate for model training; therefore, embrace “good enough” and move forward. (Think profits and bonuses.)
  2. Synthetic data can be produced and updated more easily that fooling around with “real” data. Assembling training sets, tests, deploying and reprocessing are time sucks. (There is more work to do than humanoids to do it when it comes to training, which is needed frequently for some applications.)
  3. Synthetic datasets can be smaller. Even baby Satan aka Sam Altman is down with synthetic data. Why? Elon could only buy so many nVidia processing units. Thus finding a way to train models with synthetic data works around a supply bottleneck.

My summary of the Googlers’ article is much more brief than the original: Better, faster, cheaper.

You don’t have to pick one. Just believe the Google. Who does not trust the Google? Why not buy synthetic data and ready-to-deploy models for your next AutoGPT product? Google’s approach worked like a champ for online ads. Therefore, Google’s approach will work for your smart software. Trust Google.

Stephen  E Arnold, April 27, 2023

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta