Synthetic Data: From Science Fiction to Functional Circumscription

March 4, 2024

green-dino_thumbThis essay is the work of a dumb humanoid. No smart software required.

Synthetic data are information produced by algorithms, not by real-world events. It’s created using real-world data and numerical recipes. The appeal is that it is easier than collecting real life information, cheaper than dealing with data from real life, and faster than fooling around with surveys, monitoring devices, and law suits. In theory, synthetic data is one promising way of skirting the expense of getting humans involved.

What Is [a] Synthetic Sample – And Is It All It’s Cracked Up to Be?” tackles the subject of a synthetic sample, a topic which is one slice of the synthetic data universe. The article seeks “to uncover the truth behind artificially created qualitative and quantitative market research data.” I am going to avoid the question, “Is synthetic data useful?” because the answer is, “Yes.” Bean counters and those looking to find a way out of the pickle barrel filled with expensive brine are going to chase after the magic of algorithms producing data to do some machine learning magic.

image

In certain situations, fake flowers are super. Other times, the faux blooms are just creepy. Thanks, MSFT Copilot Bing thing. Good enough.

Are synthetic data better than real world data? The answer from my vantage point is, “It depends.” Fancy math can prove that for some use cases, synthetic data are “good enough”; that is, the data produce results close enough to what a “real” data set provides. Therefore, just use synthetic data. But for other applications, synthetic data might throw some sand in the well-oiled marketing collateral describing the wonders of synthetic data. (Some university research labs are quite skilled in PR speak, but the reality of their methods may not line up with the PowerPoints used to raise venture capital.)

This essay discusses a research project to figure out if a synthetic sample works or in my lingo if the synthetic sample is good enough. The idea is that as long as the synthetic data is within a specified error range, the synthetic sample can be used and may produce “reliable” or useful results. (At least one hopes this is the case.)

I want to focus on one portion of the cited article and invite you to read the complete Kantar explanation.

Here’s the passage which snagged my attention:

… right now, synthetic sample currently has biases, lacks variation and nuance in both qual and quant analysis. On its own, as it stands, it’s just not good enough to use as a supplement for human sample. And there are other issues to consider. For instance, it matters what subject is being discussed. General political orientation could be easy for a large language model (LLM), but the trial of a new product is hard. And fundamentally, it will always be sensitive to its training data – something entirely new that is not part of its training will be off-limits. And the nature of questioning matters – a highly ’specific’ question that might require proprietary data or modelling (e.g., volume or revenue for a particular product in response to a price change) might elicit a poor-quality response, while a response to a general attitude or broad trend might be more acceptable.

These sentences present several thorny problems is academic speak. Let’s look at them in the vernacular of rural Kentucky where I live.

First, we have the issue of bias. Training data can be unintentionally or intentionally biased. Sample radical trucker posts on Telegram, and use those messages to train a model like Reor. That output is going to express views that some people might find unpalatable. Therefore, building a synthetic data recipe which includes this type of Telegram content is going to be oriented toward truck driver views. That’s good and bad.

Second, a synthetic sample may require mixing data from a “real” sample. That’s a common sense approach which reduces some costs. But will the outputs be good enough. The question then becomes, “Good enough for what applications?” Big, general questions about how a topic is presented might be close enough for horseshoes. Other topics like those focusing on dealing with a specific technical issue might warrant more caution or outright avoidance of synthetic data. Do you want your child or wife to die because the synthetic data about a treatment regimen was close enough for horseshoes. But in today’s medical structure, that may be what the future holds.

Third, many years ago, one of the early “smart” software companies was Autonomy, founded by Mike Lynch. In the 1990s, Bayesian methods were known but some — believe it or not — were classified and, thus, not widely known. Autonomy packed up some smart software in the Autonomy black box. Users of this system learned that the smart software had to be retrained because new terms and novel ideas not in the original training set were not findable by the neuro linguistic program’s engine.  Yikes, retraining requires human content curation of data sets, time to retrain the system, and the expense of redeploying the brains of the black boxes. Clients did not like this and some, to be frank, did not understand why a product did not work like an MG sports car. Synthetic data has to be trained to “know” about new terms and avid the “certain blindness” probability based systems possess.

Fourth, the topic of “proprietary data modeling” means big bucks. The idea behind synthetic data is that it is cheaper. Building proprietary training data and keeping it current is expensive. Is it better? Yeah, maybe. Is it faster? Probably not when humans are doing the curation, cleaning, verifying, and training.

The write up states:

But it’s likely that blended models (human supplemented by synthetic sample) will become more common as LLMs get even more powerful – especially as models are finetuned on proprietary datasets.

Net net: Synthetic data warrants monitoring. Some may want to invest in synthetic data set companies like Kantar, for instance. I am a dinobaby, and I like the old-fashioned Stone Age approach to data. The fancy math embodies sufficient risk for me. Why increase risk? Remember my reference to a dead loved one? That type of risk.

Stephen E Arnold, March 4, 2023

Cognitive Blind Spot 1: Can You Identify Synthetic Data? Better Learn.

October 5, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

It has been a killer with the back-to-back trips to Europe and then to the intellectual hub of the old-fashioned America. In France, I visited a location allegedly the office of a company which “owns” the domain rrrrrrrrrrr.com. No luck. Fake address. I then visited a semi-sensitive area in Paris, walking around in the confused fog only a 78 year old can generate. My goal was to spot a special type of surveillance camera designed to provide data to a smart software system. The idea is that the images can be monitored through time so a vehicle making frequent passes of a structure can be flagged, its number tag read, and a bit of thought given to answer the question, “Why?” I visited with a friend and big brain who was one of the technical keystones of an advanced search system. He gave me his most recent book and I paid for my Orangina. Exciting.

10 5 financial documents

One executive tells his boss, “Sir, our team of sophisticated experts reviewed these documents. The documents passed scrutiny.” One of the “smartest people in the room” asks, “Where are we going for lunch today?” Thanks, MidJourney. You do understand executive stereotypes, don’t you?

On the flights, I did some thinking about synthetic data. I am not sure that most people can provide a definition which will embrace the Google’s efforts in the money saving land of synthetic. I don’t think too many people know about Charlie Javice’s use of synthetic data to whip up JPMC’s enthusiasm for her company Frank Financial. I don’t think most people understand that when typing a phrase into the Twitch AI Jesus that software will output a video and mostly crazy talk along with some Christian lingo.

The purpose of this short blog post is to present an example of synthetic data and conclude by revisiting the question, “Can You Identify Synthetic Data?” The article I want to use as a hook for this essay is from Fortune Magazine. I love that name, and I think the wolves of Wall Street find it euphonious as well. Here’s the title: “Delta Is Fourth Major U.S. Airline to Find Fake Jet Aircraft Engine Parts with Forged Airworthiness Documents from U.K. Company.”

The write up states:

Delta Air Lines Inc. has discovered unapproved components in “a small number” of its jet aircraft engines, becoming the latest carrier and fourth major US airline to disclose the use of fake parts.  The suspect components — which Delta declined to identify — were found on an unspecified number of its engines, a company spokesman said Monday. Those engines account for less than 1% of the more than 2,100 power plants on its mainline fleet, the spokesman said. 

Okay, bad parts can fail. If the failure is in a critical component of a jet engine, the aircraft could — note that I am using the word could — experience a catastrophic failure. Translating catastrophic into more colloquial lingo, the sentence means catch fire and crash or something slightly less terrible; namely, catch fire, explode, eject metal shards into the tail assembly, or make a loud noise and emit smoke. Exciting, just not terminal.

I don’t want to get into how the synthetic or fake data made its way through the UK company, the UK bureaucracy, the Delta procurement process, and into the hands of the mechanics working in the US or offshore. The fake data did elude scrutiny for some reason. With money being of paramount importance, my hunch is that saving some money played a role.

If organizations cannot spot fake data when it relates to a physical and mission-critical component, how will organizations deal with fake data generated by smart software. The smart software can get it wrong because an engineer-programmer screwed up his or her math or the complex web of algorithms just generate unanticipated behaviors from dependencies no one knew to check and validate.

What happens when computers which many people are “always” more right than a human, says, “Here’s the answer.” Many humans will skip the hard work because they are in a hurry, have no appetite for grunt work, or are scheduled by a Microsoft calendar to do something else when the quality assurance testing is supposed to take place.

Let’s go back to the question in the title of the blog post, “Can You Identify Synthetic Data?”

I don’t want to forget this part of the title, “Better learn.”

JPMC paid out more than $100 million in November 2022 because some of the smartest guys in the room weren’t that smart. But get this. JPMC is a big, rich bank. People who could die because of synthetic data are a different kettle of fish. Yeah, that’s what I thought about as I flew Delta back to the US from Paris. At the time, I thought Delta had not fallen prey to the scam.

I was wrong. Hence, I “better learn” myself.

Stephen E Arnold, October 5, 2023

A Googley Rah Rah for Synthetic Data

April 27, 2023

Vea4_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I want to keep this short. I know from experience that most people don’t think too much about synthetic data. The idea is important, but other concepts are important and no one really cares too much. When was the last time Euler’s Number came up at lunch?

A gaggle of Googlers extoll the virtues of synthetic in a 19 page ArXiv document called “Synthetic Data from Diffusion Models Improves ImageNet Classification.” The main idea is that data derived from “real” data are an expedient way to improve some indexing tasks.

I am not sure that a quote from the paper will do much to elucidate this facet of the generative model world. The paper includes charts, graphs, references to math, footnotes, a few email addresses, some pictures, wonky jargon, and this conclusion:

And we have shown improvements to ImageNet classification accuracy extend to large amounts of generated data, across a range of ResNet and Transformer-based models.

The specific portion of this quote which is quite important in my experience is the segment “across a range of ResNet and Transformer-based models.” Translating to Harrod’s Creek lingo, I think the wizards are saying, “Synthetic data is really good for text too.”

What’s bubbling beneath the surface of this archly-written paper? Here are my answers to this question:

  1. Synthetic data are a heck of a lot cheaper to generate for model training; therefore, embrace “good enough” and move forward. (Think profits and bonuses.)
  2. Synthetic data can be produced and updated more easily that fooling around with “real” data. Assembling training sets, tests, deploying and reprocessing are time sucks. (There is more work to do than humanoids to do it when it comes to training, which is needed frequently for some applications.)
  3. Synthetic datasets can be smaller. Even baby Satan aka Sam Altman is down with synthetic data. Why? Elon could only buy so many nVidia processing units. Thus finding a way to train models with synthetic data works around a supply bottleneck.

My summary of the Googlers’ article is much more brief than the original: Better, faster, cheaper.

You don’t have to pick one. Just believe the Google. Who does not trust the Google? Why not buy synthetic data and ready-to-deploy models for your next AutoGPT product? Google’s approach worked like a champ for online ads. Therefore, Google’s approach will work for your smart software. Trust Google.

Stephen  E Arnold, April 27, 2023

Synthetic Data: Yes, They Are a Thing

March 13, 2023

“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.

Enter synthetic data.

If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:

The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.

Helpful, correct?

The article does not point out two things which I find of interest.

First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.

Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.

I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.

Stephen E Arnold, March 13, 2023

The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry

October 6, 2022

Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?

No. Good. Enjoy the outputs.

Yes. Too bad. You lose.

For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”

The sub title is:

A new solution for data hungry AIs

And the sub sub title is:

Content provided by IBM and TNW.

Let’s check out what this IBM content marketing write up says:

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

What are the downsides of synthetic data? Downsides? Don’t be silly:

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

There is one, very small, almost miniscule issue stated in the write up; to wit:

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.

But what about that disaffected MBA developer fiddling with thresholds?

I know the answer to this question, “That will never happen.”

Okay, I am convinced. You know the “we need” thing.

Stephen E Arnold, October 6, 2022

Synthetic Data: Cheap, Like Fast Food

May 25, 2022

Fabricated data may well solve some of the privacy issues around healthcare-related machine learning, but what new problems might it create? The Wall Street Journal examines the technology in, “Anthem Looks to Fuel AI Efforts with Petabytes of Synthetic Data.” Reporter Isabelle Bousquette informs us Anthem CIO Anil Bhatt has teamed up with Google Cloud to build the synthetic data platform. Interesting choice, considering the health insurance company has been using AWS since 2017.

The article points out synthetic data can refer to either anonymized personal information or entirely fabricated data. Anthem’s effort involves the second type. Bousquette cites Bhatt as well as AI and automation expert Ritu Jyoti as she writes:

“Anthem said the synthetic data will be used to validate and train AI algorithms that identify things like fraudulent claims or abnormalities in a person’s health records, and those AI algorithms will then be able to run on real-world member data. Anthem already uses AI algorithms to search for fraud and abuse in insurance claims, but the new synthetic data platform will allow it to scale. Personalizing care for members and running AI algorithms that identify when they may require medical intervention is a more long-term goal, said Mr. Bhatt. In addition to alleviating privacy concerns, Ms. Jyoti said another advantage of synthetic data is that it can reduce biases that exist in real-world data sets. That said, she added, you can also end up with data sets that are worse than real-world ones. ‘The variation of the data is going to be very, very important,’ said Mr. Bhatt, adding that he believes the variation in the synthetic data will ultimately be better than the company’s real-world data sets.”

The article notes the use of synthetic data is on the rise. Increasing privacy and reducing bias both sound great, but that bit about potentially worse data sets is concerning. Bhatt’s assurance is pleasant enough, but how can will we know whether his confidence pans out? Big corporations are not exactly known for their transparency.

Cynthia Murrell, May 25, 2022

Quick Question: Fabricated or Synthetic Data?

March 24, 2022

I read “Evidence of Fabricated Data in a Vitamin C trial by Paul E Marik et al in CHEST.” Non-reproducibility appears to be a function of modern statistical methods. Okay. The angle in this article is:

… within about 5 minutes of reading the study it became overwhelmingly clear that it is indeed research fraud and the data is (sic) fabricated.

Synthetic data are fabricated. Some big outfits are into using machine generated data sort of related to real life data to save money.

Here’s my question:

What’s the difference between fabricated data and synthetic data?

I am leaning to “not much.” One might argues that the motives in a research paper is tenure. In other applications, maybe the goal is just efficiency and its close friend money. My larger concern is that embedding fabricated and / or synthetic data into applications may lead to some unexpected consequences. Hey, how about that targeting of a kinetic? Screwy ad targeting is one thing, but less benign situations can be easily conceptualized; for example, “We’re sorry. That smart car self driving module did not detect your mom in the crosswalk.”

Stephen E Arnold, March 24, 2022

Synthetic Data Are Better Than Data from Real Life. Does Better Mean Cheaper?

March 22, 2022

I read “When It Comes To AI, Can We Ditch The Datasets?” The answer, as you may have surmised, is, “Absolutely.”

Synthetic data is like polystyrene. Great for building giant garbage islands and not so great for the environment. So trade offs. What’s the big deal?

The write up explains:

To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.

What can one do with made up data about real life? That’s an easy one to answer:

Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing.

What can go wrong? According to the article, nothing.

Well, nothing might be too strong a word. The write up admits:

But he [a fake data wizard] cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they aren’t properly audited.

Yeah, privacy and bias. The write up does not mention incorrect or off base guidance.

But that’s close enough for nothing for an expert hooked up with IBM Watson (yes, that Watson) and MIT (yes, the esteemed institution which was in financial thrall to one alleged human trafficker).

And Dr. Timnit Gebru’s concerns? Not mentioned. And what about the problems identified in Cathy O’Neill’s Weapons of Math Destruction? Not mentioned.

Hey, it was a short article. Synthetic data are a thing and they will help grade your child’s school work, determine who is a top performer, and invest your money so you can retire with no worries.

No worries, right.

Stephen E Arnold, March 22, 2022

Synthetic Data: The Future Because Real Data Is Too Inefficient

January 28, 2022

One of the biggest problems with AI advancement is the lack of comprehensive datasets. AI algorithms use datasets to learn how to interpret and understand information. The lack of datasets has resulted in biased aka faulty algorithms. The most notorious examples are “racist” photo recognition or light sensitivity algorithms that are unable to distinguish dark skin tones. VentureBeat shares that a new niche market has sprung up: “Synthetic Data Platform Mostly AI Lands $25M.”

Mostly AI is an Austria startup that specializes in synthetic data for AI model testing and training. The company recently acquired $25 million in funding from Molten Ventures with plans to invest the funds to accelerate the industry. Mostly AI plans to hire more employees, create unbiased algorithms, and increase their presence in Europe and North America.

It is difficult for AI developers to roundup comprehensive datasets, because of privacy concerns. There is tons of data available for AI got learn from, but it might not be anonymous and it could be biased from the get go.

Mostly AI simulates real datasets by replicating the information for data value chains but removing the personal data points. The synthetic data is described as “good as the real thing” without violating privacy laws. The synthetic data algorithm works like other algorithms:

“The solution works by leveraging a state-of-the-art generative deep neural network with an in-built privacy mechanism. It learns valuable statistical patterns, structures, and variations from the original data and recreates these patterns using a population of fictional characters to give out a synthetic copy that is privacy compliant, de-biased, and just as useful as the original dataset – reflecting behaviors and patterns with up to 99% accuracy.”

Mostly AI states that their platform also accelerates the time it takes to access the datasets. They claim their technology reduces the wait time by 90%.

Demands for synthetic data are growing as the AI industry burgeons and there is a need for information to advance the technology. Efficient, acceptable error rates, objective methods: What could go wrong?

Whitney Grace, January 27, 2022

Facebook and Synthetic Data

October 13, 2021

What’s Facebook thinking about its data future?

A partial answer may be that the company is doing some contingency planning. When regulators figure out how to trim Facebook’s data hoovering, the company may have less primary data to mine, refine, and leverage.

The solution?

Synthetic data. The jargon means annotated data that computer simulations output. Run the model. Fiddle with the thresholds. Get good enough data.

How does one get a signal about Facebook’s interest in synthetic data?

Facebook, according to Venture Beat, the responsible social media company acquired AI.Reverie.

Was this a straight forward deal? Sure, just via a Facebook entity called Dolores Acquisition Sub, Inc. If this sounds familiar, the social media leader may have taken its name from a motion picture called “Westworld.”

The write up states:

AI.Reverie — which competed with startups like Tonic, Delphix, Mostly AI, Hazy, Gretel.ai, and Cvedia, among others — has a long history of military and defense contracts. In 2019, the company announced a strategic alliance with Booz Allen Hamilton with the introduction of Modzy at Nvidia’s GTC DC conference. Through Modzy — a platform for managing and deploying AI models — AI.Reverie launched a weapons detection model that ostensibly could spot ammunition, explosives, artillery, firearms, missiles, and blades from “multiple perspectives.”

Booz, Allen may be kicking its weaker partners. Perhaps the wizards at the consulting firm should have purchased AI.Reverie. But Facebook aced out the century old other people’s business outfit. (Note: I used to labor in the BAH vineyards, and I feel sorry for the individuals who were not enthusiastic about acquiring AI.Reverie. Where did that bonus go?)

Several observations are warranted:

  1. Synthetic data is the ideal dating partner for Snorkel-type machine learning systems
  2. Some researchers believe that real data is better than synthetic data, but that is a fight like spats between those who love Windows and those who love Mac OSX
  3. The uptake of “good” enough data for smart statistical systems which aim for 60 percent or better “accuracy” appears to be a mini trend.

Worth watching?

Stephen E Arnold, October 13, 2021

Next Page »

  • Archives

  • Recent Posts

  • Meta