Cognitive Blind Spot 1: Can You Identify Synthetic Data? Better Learn.

October 5, 2023

Vea4_thumb_thumb_thumb_thumb_thumb_tNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

It has been a killer with the back-to-back trips to Europe and then to the intellectual hub of the old-fashioned America. In France, I visited a location allegedly the office of a company which “owns” the domain rrrrrrrrrrr.com. No luck. Fake address. I then visited a semi-sensitive area in Paris, walking around in the confused fog only a 78 year old can generate. My goal was to spot a special type of surveillance camera designed to provide data to a smart software system. The idea is that the images can be monitored through time so a vehicle making frequent passes of a structure can be flagged, its number tag read, and a bit of thought given to answer the question, “Why?” I visited with a friend and big brain who was one of the technical keystones of an advanced search system. He gave me his most recent book and I paid for my Orangina. Exciting.

10 5 financial documents

One executive tells his boss, “Sir, our team of sophisticated experts reviewed these documents. The documents passed scrutiny.” One of the “smartest people in the room” asks, “Where are we going for lunch today?” Thanks, MidJourney. You do understand executive stereotypes, don’t you?

On the flights, I did some thinking about synthetic data. I am not sure that most people can provide a definition which will embrace the Google’s efforts in the money saving land of synthetic. I don’t think too many people know about Charlie Javice’s use of synthetic data to whip up JPMC’s enthusiasm for her company Frank Financial. I don’t think most people understand that when typing a phrase into the Twitch AI Jesus that software will output a video and mostly crazy talk along with some Christian lingo.

The purpose of this short blog post is to present an example of synthetic data and conclude by revisiting the question, “Can You Identify Synthetic Data?” The article I want to use as a hook for this essay is from Fortune Magazine. I love that name, and I think the wolves of Wall Street find it euphonious as well. Here’s the title: “Delta Is Fourth Major U.S. Airline to Find Fake Jet Aircraft Engine Parts with Forged Airworthiness Documents from U.K. Company.”

The write up states:

Delta Air Lines Inc. has discovered unapproved components in “a small number” of its jet aircraft engines, becoming the latest carrier and fourth major US airline to disclose the use of fake parts.  The suspect components — which Delta declined to identify — were found on an unspecified number of its engines, a company spokesman said Monday. Those engines account for less than 1% of the more than 2,100 power plants on its mainline fleet, the spokesman said. 

Okay, bad parts can fail. If the failure is in a critical component of a jet engine, the aircraft could — note that I am using the word could — experience a catastrophic failure. Translating catastrophic into more colloquial lingo, the sentence means catch fire and crash or something slightly less terrible; namely, catch fire, explode, eject metal shards into the tail assembly, or make a loud noise and emit smoke. Exciting, just not terminal.

I don’t want to get into how the synthetic or fake data made its way through the UK company, the UK bureaucracy, the Delta procurement process, and into the hands of the mechanics working in the US or offshore. The fake data did elude scrutiny for some reason. With money being of paramount importance, my hunch is that saving some money played a role.

If organizations cannot spot fake data when it relates to a physical and mission-critical component, how will organizations deal with fake data generated by smart software. The smart software can get it wrong because an engineer-programmer screwed up his or her math or the complex web of algorithms just generate unanticipated behaviors from dependencies no one knew to check and validate.

What happens when computers which many people are “always” more right than a human, says, “Here’s the answer.” Many humans will skip the hard work because they are in a hurry, have no appetite for grunt work, or are scheduled by a Microsoft calendar to do something else when the quality assurance testing is supposed to take place.

Let’s go back to the question in the title of the blog post, “Can You Identify Synthetic Data?”

I don’t want to forget this part of the title, “Better learn.”

JPMC paid out more than $100 million in November 2022 because some of the smartest guys in the room weren’t that smart. But get this. JPMC is a big, rich bank. People who could die because of synthetic data are a different kettle of fish. Yeah, that’s what I thought about as I flew Delta back to the US from Paris. At the time, I thought Delta had not fallen prey to the scam.

I was wrong. Hence, I “better learn” myself.

Stephen E Arnold, October 5, 2023

A Googley Rah Rah for Synthetic Data

April 27, 2023

Vea4_thumb_thumb_thumbNote: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.

I want to keep this short. I know from experience that most people don’t think too much about synthetic data. The idea is important, but other concepts are important and no one really cares too much. When was the last time Euler’s Number came up at lunch?

A gaggle of Googlers extoll the virtues of synthetic in a 19 page ArXiv document called “Synthetic Data from Diffusion Models Improves ImageNet Classification.” The main idea is that data derived from “real” data are an expedient way to improve some indexing tasks.

I am not sure that a quote from the paper will do much to elucidate this facet of the generative model world. The paper includes charts, graphs, references to math, footnotes, a few email addresses, some pictures, wonky jargon, and this conclusion:

And we have shown improvements to ImageNet classification accuracy extend to large amounts of generated data, across a range of ResNet and Transformer-based models.

The specific portion of this quote which is quite important in my experience is the segment “across a range of ResNet and Transformer-based models.” Translating to Harrod’s Creek lingo, I think the wizards are saying, “Synthetic data is really good for text too.”

What’s bubbling beneath the surface of this archly-written paper? Here are my answers to this question:

  1. Synthetic data are a heck of a lot cheaper to generate for model training; therefore, embrace “good enough” and move forward. (Think profits and bonuses.)
  2. Synthetic data can be produced and updated more easily that fooling around with “real” data. Assembling training sets, tests, deploying and reprocessing are time sucks. (There is more work to do than humanoids to do it when it comes to training, which is needed frequently for some applications.)
  3. Synthetic datasets can be smaller. Even baby Satan aka Sam Altman is down with synthetic data. Why? Elon could only buy so many nVidia processing units. Thus finding a way to train models with synthetic data works around a supply bottleneck.

My summary of the Googlers’ article is much more brief than the original: Better, faster, cheaper.

You don’t have to pick one. Just believe the Google. Who does not trust the Google? Why not buy synthetic data and ready-to-deploy models for your next AutoGPT product? Google’s approach worked like a champ for online ads. Therefore, Google’s approach will work for your smart software. Trust Google.

Stephen  E Arnold, April 27, 2023

Synthetic Data: Yes, They Are a Thing

March 13, 2023

“Real” data — that is, data generated by humans — are expensive to capture, normalize, and manipulate. But, those “real” data are important. Unfortunately some companies have sucked up real data and integrated those items into products and services. Now regulators are awakening from a decades-long slumber and taking a look into the actions of certain data companies. More importantly, a few big data outfits are aware of the [a] the costs and [b] the risks of real data.

Enter synthetic data.

If you are unfamiliar with the idea, navigate to “What is Synthetic Data? The Good, the Bad, and the Ugly.” The article states:

The privacy engineering community can help practitioners and stakeholders identify the use cases where synthetic data can be used safely, perhaps even in a semi-automated way. At the very least, the research community can provide actionable guidelines to understand the distributions, types of data, tasks, etc. where we could achieve reasonable privacy-utility tradeoffs via synthetic data produced by generative models.

Helpful, correct?

The article does not point out two things which I find of interest.

First, the amount of money a company can earn by operating efficient synthetic data factories is likely to be substantial. Like other digital products, the upside can be profitable and give the “owner” of the synthetic data market and IBM-type of old-school lock in.

Second, synthetic data can be weaponized either intentionally via data poisoning or algorithm shaping.

I just wanted to point out that a useful essay does not explore what may be two important attributes of synthetic data. Will regulators rise to the occasion? Unlikely.

Stephen E Arnold, March 13, 2023

The Push for Synthetic Data: What about Poisoning and Bias? Not to Worry

October 6, 2022

Do you worry about data poisoning, use of crafted data strings to cause numerical recipes to output craziness, and weaponized information shaped by a disaffected MBA big data developer sloshing with DynaPep?

No. Good. Enjoy the outputs.

Yes. Too bad. You lose.

For a rah rah, it’s sunny in Slough look at synthetic data, read “Synthetic Data Is the Safe, Low-Cost Alternative to Real Data That We Need.”

The sub title is:

A new solution for data hungry AIs

And the sub sub title is:

Content provided by IBM and TNW.

Let’s check out what this IBM content marketing write up says:

One example is Task2Sim, an AI model built by the MIT-IBM Watson AI Lab that creates synthetic data for training classifiers. Rather than teaching the classifier to recognize one object at a time, the model creates images that can be used to teach multiple tasks. The scalability of this type of model makes collecting data less time consuming and less expensive for data hungry businesses.

What are the downsides of synthetic data? Downsides? Don’t be silly:

Synthetic data, however it is produced, offers a number of very concrete advantages over using real world data. First of all, it’s easier to collect way more of it, because you don’t have to rely on humans creating it. Second, the synthetic data comes perfectly labeled, so there’s no need to rely on labor intensive data centers to (sometimes incorrectly) label data. Third, it can protect privacy and copyright, as the data is, well, synthetic. And finally, and perhaps most importantly, it can reduce biased outcomes.

There is one, very small, almost miniscule issue stated in the write up; to wit:

As you might suspect, the big question regarding synthetic data is around the so-called fidelity — or how closely it matches real-world data. The jury is still out on this, but research seems to show that combining synthetic data with real data gives statistically sound results. This year, researchers from MIT and the MIT-IBM AI Watson Lab showed that an image classifier that was pretrained on synthetic data in combination with real data, performed as well as an image classifier trained exclusively on real data.

I loved the “seems to show” phrase I put in bold face. Seems is such a great verb. It “seems” almost accurate.

But what about that disaffected MBA developer fiddling with thresholds?

I know the answer to this question, “That will never happen.”

Okay, I am convinced. You know the “we need” thing.

Stephen E Arnold, October 6, 2022

Synthetic Data: Cheap, Like Fast Food

May 25, 2022

Fabricated data may well solve some of the privacy issues around healthcare-related machine learning, but what new problems might it create? The Wall Street Journal examines the technology in, “Anthem Looks to Fuel AI Efforts with Petabytes of Synthetic Data.” Reporter Isabelle Bousquette informs us Anthem CIO Anil Bhatt has teamed up with Google Cloud to build the synthetic data platform. Interesting choice, considering the health insurance company has been using AWS since 2017.

The article points out synthetic data can refer to either anonymized personal information or entirely fabricated data. Anthem’s effort involves the second type. Bousquette cites Bhatt as well as AI and automation expert Ritu Jyoti as she writes:

“Anthem said the synthetic data will be used to validate and train AI algorithms that identify things like fraudulent claims or abnormalities in a person’s health records, and those AI algorithms will then be able to run on real-world member data. Anthem already uses AI algorithms to search for fraud and abuse in insurance claims, but the new synthetic data platform will allow it to scale. Personalizing care for members and running AI algorithms that identify when they may require medical intervention is a more long-term goal, said Mr. Bhatt. In addition to alleviating privacy concerns, Ms. Jyoti said another advantage of synthetic data is that it can reduce biases that exist in real-world data sets. That said, she added, you can also end up with data sets that are worse than real-world ones. ‘The variation of the data is going to be very, very important,’ said Mr. Bhatt, adding that he believes the variation in the synthetic data will ultimately be better than the company’s real-world data sets.”

The article notes the use of synthetic data is on the rise. Increasing privacy and reducing bias both sound great, but that bit about potentially worse data sets is concerning. Bhatt’s assurance is pleasant enough, but how can will we know whether his confidence pans out? Big corporations are not exactly known for their transparency.

Cynthia Murrell, May 25, 2022

Quick Question: Fabricated or Synthetic Data?

March 24, 2022

I read “Evidence of Fabricated Data in a Vitamin C trial by Paul E Marik et al in CHEST.” Non-reproducibility appears to be a function of modern statistical methods. Okay. The angle in this article is:

… within about 5 minutes of reading the study it became overwhelmingly clear that it is indeed research fraud and the data is (sic) fabricated.

Synthetic data are fabricated. Some big outfits are into using machine generated data sort of related to real life data to save money.

Here’s my question:

What’s the difference between fabricated data and synthetic data?

I am leaning to “not much.” One might argues that the motives in a research paper is tenure. In other applications, maybe the goal is just efficiency and its close friend money. My larger concern is that embedding fabricated and / or synthetic data into applications may lead to some unexpected consequences. Hey, how about that targeting of a kinetic? Screwy ad targeting is one thing, but less benign situations can be easily conceptualized; for example, “We’re sorry. That smart car self driving module did not detect your mom in the crosswalk.”

Stephen E Arnold, March 24, 2022

Synthetic Data Are Better Than Data from Real Life. Does Better Mean Cheaper?

March 22, 2022

I read “When It Comes To AI, Can We Ditch The Datasets?” The answer, as you may have surmised, is, “Absolutely.”

Synthetic data is like polystyrene. Great for building giant garbage islands and not so great for the environment. So trade offs. What’s the big deal?

The write up explains:

To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.

What can one do with made up data about real life? That’s an easy one to answer:

Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing.

What can go wrong? According to the article, nothing.

Well, nothing might be too strong a word. The write up admits:

But he [a fake data wizard] cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they aren’t properly audited.

Yeah, privacy and bias. The write up does not mention incorrect or off base guidance.

But that’s close enough for nothing for an expert hooked up with IBM Watson (yes, that Watson) and MIT (yes, the esteemed institution which was in financial thrall to one alleged human trafficker).

And Dr. Timnit Gebru’s concerns? Not mentioned. And what about the problems identified in Cathy O’Neill’s Weapons of Math Destruction? Not mentioned.

Hey, it was a short article. Synthetic data are a thing and they will help grade your child’s school work, determine who is a top performer, and invest your money so you can retire with no worries.

No worries, right.

Stephen E Arnold, March 22, 2022

Synthetic Data: The Future Because Real Data Is Too Inefficient

January 28, 2022

One of the biggest problems with AI advancement is the lack of comprehensive datasets. AI algorithms use datasets to learn how to interpret and understand information. The lack of datasets has resulted in biased aka faulty algorithms. The most notorious examples are “racist” photo recognition or light sensitivity algorithms that are unable to distinguish dark skin tones. VentureBeat shares that a new niche market has sprung up: “Synthetic Data Platform Mostly AI Lands $25M.”

Mostly AI is an Austria startup that specializes in synthetic data for AI model testing and training. The company recently acquired $25 million in funding from Molten Ventures with plans to invest the funds to accelerate the industry. Mostly AI plans to hire more employees, create unbiased algorithms, and increase their presence in Europe and North America.

It is difficult for AI developers to roundup comprehensive datasets, because of privacy concerns. There is tons of data available for AI got learn from, but it might not be anonymous and it could be biased from the get go.

Mostly AI simulates real datasets by replicating the information for data value chains but removing the personal data points. The synthetic data is described as “good as the real thing” without violating privacy laws. The synthetic data algorithm works like other algorithms:

“The solution works by leveraging a state-of-the-art generative deep neural network with an in-built privacy mechanism. It learns valuable statistical patterns, structures, and variations from the original data and recreates these patterns using a population of fictional characters to give out a synthetic copy that is privacy compliant, de-biased, and just as useful as the original dataset – reflecting behaviors and patterns with up to 99% accuracy.”

Mostly AI states that their platform also accelerates the time it takes to access the datasets. They claim their technology reduces the wait time by 90%.

Demands for synthetic data are growing as the AI industry burgeons and there is a need for information to advance the technology. Efficient, acceptable error rates, objective methods: What could go wrong?

Whitney Grace, January 27, 2022

Facebook and Synthetic Data

October 13, 2021

What’s Facebook thinking about its data future?

A partial answer may be that the company is doing some contingency planning. When regulators figure out how to trim Facebook’s data hoovering, the company may have less primary data to mine, refine, and leverage.

The solution?

Synthetic data. The jargon means annotated data that computer simulations output. Run the model. Fiddle with the thresholds. Get good enough data.

How does one get a signal about Facebook’s interest in synthetic data?

Facebook, according to Venture Beat, the responsible social media company acquired AI.Reverie.

Was this a straight forward deal? Sure, just via a Facebook entity called Dolores Acquisition Sub, Inc. If this sounds familiar, the social media leader may have taken its name from a motion picture called “Westworld.”

The write up states:

AI.Reverie — which competed with startups like Tonic, Delphix, Mostly AI, Hazy, Gretel.ai, and Cvedia, among others — has a long history of military and defense contracts. In 2019, the company announced a strategic alliance with Booz Allen Hamilton with the introduction of Modzy at Nvidia’s GTC DC conference. Through Modzy — a platform for managing and deploying AI models — AI.Reverie launched a weapons detection model that ostensibly could spot ammunition, explosives, artillery, firearms, missiles, and blades from “multiple perspectives.”

Booz, Allen may be kicking its weaker partners. Perhaps the wizards at the consulting firm should have purchased AI.Reverie. But Facebook aced out the century old other people’s business outfit. (Note: I used to labor in the BAH vineyards, and I feel sorry for the individuals who were not enthusiastic about acquiring AI.Reverie. Where did that bonus go?)

Several observations are warranted:

  1. Synthetic data is the ideal dating partner for Snorkel-type machine learning systems
  2. Some researchers believe that real data is better than synthetic data, but that is a fight like spats between those who love Windows and those who love Mac OSX
  3. The uptake of “good” enough data for smart statistical systems which aim for 60 percent or better “accuracy” appears to be a mini trend.

Worth watching?

Stephen E Arnold, October 13, 2021

Synthetic Datasets: Reality Bytes

February 5, 2017

Years ago I did a project for an outfit specializing in an esoteric math space based on mereology. No, I won’t define it. You can check out the explanation in the Stanford Encyclopedia of Philosophy. The idea is that sparse information can yield useful insights. Even better, if mathematical methods were use to populate missing cells in a data system, one could analyze the data as if it were more than probability generated items. Then when real time data arrived to populate the sparse cells, the probability component would generate revised data for the cells without data. Nifty idea, just tough to explain to outfits struggling to move freight or sell off lease autos.

I thought of this company’s software system when I read “Synthetic Datasets Are a Game Changer.” Once again youthful wizards happily invent the future even though some of the systems and methods have been around for decades. For more information about the approach, the journal articles and books of Dr. Zbigniew Michaelewicz may be helpful.

The “Synthetic Databases…” write up triggered some yellow highlighter activity. I found this statement interesting:

Google researchers went as far as to say that even mediocre algorithms received state-of-the-art results given enough data.

The idea that algorithms can output “good enough” results when volumes of data are available to the number munching algorithms.

I also noted:

there are recent successes using a new technique called ‘synthetic datasets’ that could see us overcome those limitations. This new type of dataset consists of images and videos that are solely rendered by computers based on various parameters or scenarios. The process through which those datasets are created fall into 2 categories: Photo realistic rendering and Scenario rendering for lack of better description.

The focus here is not on figuring out how to move nuclear fuel rods around a reactor core or adjusting coal fired power plant outputs to minimize air pollution. The synthetic databases have an application in image related disciplines.

The idea of using rendering engines to create images for facial recognition or for video games is interesting. The write up mentions a number of companies pushing forward in this field; for example, Cvedia.

However, the use of NuTech’s methods populated databases of fact. I think the use of synthetic methods has a bright future. Oh, NuTech was acquired by Netezza. Guess what company owns the prescient NuTech Solutions’ technology? Give up? IBM, a company which has potent capabilities but does the most unusual things with those important systems and methods.

I suppose that is one reason why old wine looks like new IBM Holiday Spirit rum.

Stephen E Arnold, February 5, 2017

Next Page »

  • Archives

  • Recent Posts

  • Meta