The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry

October 6, 2022

Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?

No. Good. Enjoy the outputs.

Yes. Too bad. You lose.

For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”

The sub title is:

A new solution for data hungry AIs

And the sub sub title is:

Content provided by IBM and TNW.

Let’s check out what this IBM content marketing write up says:

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

What are the downsides of synthetic data? Downsides? Don’t be silly:

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

There is one, very small, almost miniscule issue stated in the write up; to wit:

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.

But what about that disaffected MBA developer fiddling with thresholds?

I know the answer to this question, “That will never happen.”

Okay, I am convinced. You know the “we need” thing.

Stephen E Arnold, October 6, 2022

Synthetic Data: Cheap, Like Fast Food

May 25, 2022

Fabricated data may well solve some of the privacy issues around healthcare-related machine learning, but what new problems might it create? The Wall Street Journal examines the technology in, “Anthem Looks to Fuel AI Efforts with Petabytes of Synthetic Data.” Reporter Isabelle Bousquette informs us Anthem CIO Anil Bhatt has teamed up with Google Cloud to build the synthetic data platform. Interesting choice, considering the health insurance company has been using AWS since 2017.

The article points out synthetic data can refer to either anonymized personal information or entirely fabricated data. Anthem’s effort involves the second type. Bousquette cites Bhatt as well as AI and automation expert Ritu Jyoti as she writes:

“Anthem said the synthetic data will be used to validate and train AI algorithms that identify things like fraudulent claims or abnormalities in a person’s health records, and those AI algorithms will then be able to run on real-world member data. Anthem already uses AI algorithms to search for fraud and abuse in insurance claims, but the new synthetic data platform will allow it to scale. Personalizing care for members and running AI algorithms that identify when they may require medical intervention is a more long-term goal, said Mr. Bhatt. In addition to alleviating privacy concerns, Ms. Jyoti said another advantage of synthetic data is that it can reduce biases that exist in real-world data sets. That said, she added, you can also end up with data sets that are worse than real-world ones. ‘The variation of the data is going to be very, very important,’ said Mr. Bhatt, adding that he believes the variation in the synthetic data will ultimately be better than the company’s real-world data sets.”

The article notes the use of synthetic data is on the rise. Increasing privacy and reducing bias both sound great, but that bit about potentially worse data sets is concerning. Bhatt’s assurance is pleasant enough, but how can will we know whether his confidence pans out? Big corporations are not exactly known for their transparency.

Cynthia Murrell, May 25, 2022

Quick Question: Fabricated or Synthetic Data?

March 24, 2022

I read “Evidence of Fabricated Data in a Vitamin C trial by Paul E Marik et al in CHEST.” Non-reproducibility appears to be a function of modern statistical methods. Okay. The angle in this article is:

… within about 5 minutes of reading the study it became overwhelmingly clear that it is indeed research fraud and the data is (sic) fabricated.

Synthetic data are fabricated. Some big outfits are into using machine generated data sort of related to real life data to save money.

Here’s my question:

What’s the difference between fabricated data and synthetic data?

I am leaning to “not much.” One might argues that the motives in a research paper is tenure. In other applications, maybe the goal is just efficiency and its close friend money. My larger concern is that embedding fabricated and / or synthetic data into applications may lead to some unexpected consequences. Hey, how about that targeting of a kinetic? Screwy ad targeting is one thing, but less benign situations can be easily conceptualized; for example, “We’re sorry. That smart car self driving module did not detect your mom in the crosswalk.”

Stephen E Arnold, March 24, 2022

Synthetic Data Are Better Than Data from Real Life. Does Better Mean Cheaper?

March 22, 2022

I read “When It Comes To AI, Can We Ditch The Datasets?” The answer, as you may have surmised, is, “Absolutely.”

Synthetic data is like polystyrene. Great for building giant garbage islands and not so great for the environment. So trade offs. What’s the big deal?

The write up explains:

To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.

What can one do with made up data about real life? That’s an easy one to answer:

Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing.

What can go wrong? According to the article, nothing.

Well, nothing might be too strong a word. The write up admits:

But he [a fake data wizard] cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they aren’t properly audited.

Yeah, privacy and bias. The write up does not mention incorrect or off base guidance.

But that’s close enough for nothing for an expert hooked up with IBM Watson (yes, that Watson) and MIT (yes, the esteemed institution which was in financial thrall to one alleged human trafficker).

And Dr. Timnit Gebru’s concerns? Not mentioned. And what about the problems identified in Cathy O’Neill’s Weapons of Math Destruction? Not mentioned.

Hey, it was a short article. Synthetic data are a thing and they will help grade your child’s school work, determine who is a top performer, and invest your money so you can retire with no worries.

No worries, right.

Stephen E Arnold, March 22, 2022

Synthetic Data: The Future Because Real Data Is Too Inefficient

January 28, 2022

One of the biggest problems with AI advancement is the lack of comprehensive datasets. AI algorithms use datasets to learn how to interpret and understand information. The lack of datasets has resulted in biased aka faulty algorithms. The most notorious examples are “racist” photo recognition or light sensitivity algorithms that are unable to distinguish dark skin tones. VentureBeat shares that a new niche market has sprung up: “Synthetic Data Platform Mostly AI Lands $25M.”

Mostly AI is an Austria startup that specializes in synthetic data for AI model testing and training. The company recently acquired $25 million in funding from Molten Ventures with plans to invest the funds to accelerate the industry. Mostly AI plans to hire more employees, create unbiased algorithms, and increase their presence in Europe and North America.

It is difficult for AI developers to roundup comprehensive datasets, because of privacy concerns. There is tons of data available for AI got learn from, but it might not be anonymous and it could be biased from the get go.

Mostly AI simulates real datasets by replicating the information for data value chains but removing the personal data points. The synthetic data is described as “good as the real thing” without violating privacy laws. The synthetic data algorithm works like other algorithms:

“The solution works by leveraging a state-of-the-art generative deep neural network with an in-built privacy mechanism. It learns valuable statistical patterns, structures, and variations from the original data and recreates these patterns using a population of fictional characters to give out a synthetic copy that is privacy compliant, de-biased, and just as useful as the original dataset – reflecting behaviors and patterns with up to 99% accuracy.”

Mostly AI states that their platform also accelerates the time it takes to access the datasets. They claim their technology reduces the wait time by 90%.

Demands for synthetic data are growing as the AI industry burgeons and there is a need for information to advance the technology. Efficient, acceptable error rates, objective methods: What could go wrong?

Whitney Grace, January 27, 2022

Facebook and Synthetic Data

October 13, 2021

What’s Facebook thinking about its data future?

A partial answer may be that the company is doing some contingency planning. When regulators figure out how to trim Facebook’s data hoovering, the company may have less primary data to mine, refine, and leverage.

The solution?

Synthetic data. The jargon means annotated data that computer simulations output. Run the model. Fiddle with the thresholds. Get good enough data.

How does one get a signal about Facebook’s interest in synthetic data?

Facebook, according to Venture Beat, the responsible social media company acquired AI.Reverie.

Was this a straight forward deal? Sure, just via a Facebook entity called Dolores Acquisition Sub, Inc. If this sounds familiar, the social media leader may have taken its name from a motion picture called “Westworld.”

The write up states:

AI.Reverie — which competed with startups like Tonic, Delphix, Mostly AI, Hazy,, and Cvedia, among others — has a long history of military and defense contracts. In 2019, the company announced a strategic alliance with Booz Allen Hamilton with the introduction of Modzy at Nvidia’s GTC DC conference. Through Modzy — a platform for managing and deploying AI models — AI.Reverie launched a weapons detection model that ostensibly could spot ammunition, explosives, artillery, firearms, missiles, and blades from “multiple perspectives.”

Booz, Allen may be kicking its weaker partners. Perhaps the wizards at the consulting firm should have purchased AI.Reverie. But Facebook aced out the century old other people’s business outfit. (Note: I used to labor in the BAH vineyards, and I feel sorry for the individuals who were not enthusiastic about acquiring AI.Reverie. Where did that bonus go?)

Several observations are warranted:

  1. Synthetic data is the ideal dating partner for Snorkel-type machine learning systems
  2. Some researchers believe that real data is better than synthetic data, but that is a fight like spats between those who love Windows and those who love Mac OSX
  3. The uptake of “good” enough data for smart statistical systems which aim for 60 percent or better “accuracy” appears to be a mini trend.

Worth watching?

Stephen E Arnold, October 13, 2021

Synthetic Datasets: Reality Bytes

February 5, 2017

Years ago I did a project for an outfit specializing in an esoteric math space based on mereology. No, I won’t define it. You can check out the explanation in the Stanford Encyclopedia of Philosophy. The idea is that sparse information can yield useful insights. Even better, if mathematical methods were use to populate missing cells in a data system, one could analyze the data as if it were more than probability generated items. Then when real time data arrived to populate the sparse cells, the probability component would generate revised data for the cells without data. Nifty idea, just tough to explain to outfits struggling to move freight or sell off lease autos.

I thought of this company’s software system when I read “Synthetic Datasets Are a Game Changer.” Once again youthful wizards happily invent the future even though some of the systems and methods have been around for decades. For more information about the approach, the journal articles and books of Dr. Zbigniew Michaelewicz may be helpful.

The “Synthetic Databases…” write up triggered some yellow highlighter activity. I found this statement interesting:

Google researchers went as far as to say that even mediocre algorithms received state-of-the-art results given enough data.

The idea that algorithms can output “good enough” results when volumes of data are available to the number munching algorithms.

I also noted:

there are recent successes using a new technique called ‘synthetic datasets’ that could see us overcome those limitations. This new type of dataset consists of images and videos that are solely rendered by computers based on various parameters or scenarios. The process through which those datasets are created fall into 2 categories: Photo realistic rendering and Scenario rendering for lack of better description.

The focus here is not on figuring out how to move nuclear fuel rods around a reactor core or adjusting coal fired power plant outputs to minimize air pollution. The synthetic databases have an application in image related disciplines.

The idea of using rendering engines to create images for facial recognition or for video games is interesting. The write up mentions a number of companies pushing forward in this field; for example, Cvedia.

However, the use of NuTech’s methods populated databases of fact. I think the use of synthetic methods has a bright future. Oh, NuTech was acquired by Netezza. Guess what company owns the prescient NuTech Solutions’ technology? Give up? IBM, a company which has potent capabilities but does the most unusual things with those important systems and methods.

I suppose that is one reason why old wine looks like new IBM Holiday Spirit rum.

Stephen E Arnold, February 5, 2017

A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers

October 25, 2022

In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:

Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.

The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.

The write up reports:

Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.

The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.

My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.

Several observations:

  1. The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
  2. Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
  3. Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.

Does this suggest that a problem exists? If yes, what’s the fix?

[a] Ignore the problem

[b] Trust Google-like outfits who seek to be the source for synthetic data

[c] Rely on MBAs

[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities

[e] Blink.

Pick one.

Stephen E Arnold, October 25, 2022

How Apps Use Your Data: Just a Half Effort

April 28, 2022

I read an quite enthusiastic article called “Google Forces Developers to Provide Details on How Apps Use Your Data.” The main idea is virtue signaling with one of those flashing airport beacons. These can be seen through certain types of “info fog,” just not today’s info fog. The digital climate has a number of characteristics. One is obfuscation.

The write up states:

… the Data safety feature is now on the Google Play Store and aims to bolster security by providing users details on how an app is using their information. Developers are required to complete this section for their apps by July 20, and will need to provide updates if they change their data handling practices, too. 

That sounds encouraging. Google’s been at the data harvesting combine controls for more than two decades. Now app developers have to provide information about their use of an app user’s data and presumably flip on the yellow fog lights for what the folks who have access to those data via an API or a bulk transfer are doing. Amusing thought forced regulation after 240 months on the info highway.

However, what app users do with data is half of the story, maybe less. The interesting question to me is, “What does Google do with those data?”

The Data Safety initiative does not focus on the Google. Data Safety shifts the attention to app developers, presumably some of whom have crafty ideas. My interest is Google’s own data surfing; for example, ad diffusion, and my fave Snorkelization and synthetic “close enough for horseshoes” data. Real data may be to “real” for some purposes.

After a couple of decades, Google is taking steps toward a data destination. I just don’t know where that journey is taking people.

Stephen E Arnold, April 28, 2022

Why Be Like ClearView AI? Google Fabs Data the Way TSMC Makes Chips

April 8, 2022

Machine learning requires data. Lots of data. Datasets can set AI trainers back millions of dollars, and even that does not guarantee a collection free of problems like bias and privacy issues. Researchers at MIT have developed another way, at least when it comes to image identification. The World Economic Forum reports, “These AI Tools Are Teaching Themselves to Improve How they Classify Images.” Of course, one must start somewhere, so a generative model is first trained on some actual data. From there, it generates synthetic data that, we’re told, is almost indistinguishable from the real thing. Writer Adam Zewe cites the paper‘s lead author Ali Jahanian as he emphasizes:

“But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can ‘imagine’ how a car would look in different situations — situations it did not see during training — and then output images that show the car in unique poses, colors, or sizes. Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different. The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains. ‘This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations,’ he says.”

Ah, algorithmic teamwork. Another advantage of this method is the nearly infinite samples the model can generate, since more samples (usually) make for a better trained AI. Jahanian also notes once a generative model has created a repository of synthetic data, that resource can be posted online for others to use. The team also hopes to use their technique to generate corner cases, which often cannot be learned from real data sets and are especially troublesome when it comes to potentially dangerous uses like self-driving cars. If this hope is realized, it could be a huge boon.

This all sounds great, but what if—just a minor if—the model is off base? And, once this tech moves out of the laboratory, how would we know? The researchers acknowledge a couple other limitations. For one, their generative models occasionally reveal source data, which negates the privacy advantage. Furthermore, any biases in the limited datasets used for the initial training will be amplified unless the model is “properly audited.” It seems like transparency, which somehow remains elusive in commercial AI applications, would be crucial. Perhaps the researchers have an idea how to solve that riddle.

Funding for the project was supplied, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.

Cynthia Murrell, April 8, 2022

Next Page »

  • Archives

  • Recent Posts

  • Meta