Synthetic Datasets: Reality Bytes

February 5, 2017

Years ago I did a project for an outfit specializing in an esoteric math space based on mereology. No, I won’t define it. You can check out the explanation in the Stanford Encyclopedia of Philosophy. The idea is that sparse information can yield useful insights. Even better, if mathematical methods were use to populate missing cells in a data system, one could analyze the data as if it were more than probability generated items. Then when real time data arrived to populate the sparse cells, the probability component would generate revised data for the cells without data. Nifty idea, just tough to explain to outfits struggling to move freight or sell off lease autos.

I thought of this company’s software system when I read “Synthetic Datasets Are a Game Changer.” Once again youthful wizards happily invent the future even though some of the systems and methods have been around for decades. For more information about the approach, the journal articles and books of Dr. Zbigniew Michaelewicz may be helpful.

The “Synthetic Databases…” write up triggered some yellow highlighter activity. I found this statement interesting:

Google researchers went as far as to say that even mediocre algorithms received state-of-the-art results given enough data.

The idea that algorithms can output “good enough” results when volumes of data are available to the number munching algorithms.

I also noted:

there are recent successes using a new technique called ‘synthetic datasets’ that could see us overcome those limitations. This new type of dataset consists of images and videos that are solely rendered by computers based on various parameters or scenarios. The process through which those datasets are created fall into 2 categories: Photo realistic rendering and Scenario rendering for lack of better description.

The focus here is not on figuring out how to move nuclear fuel rods around a reactor core or adjusting coal fired power plant outputs to minimize air pollution. The synthetic databases have an application in image related disciplines.

The idea of using rendering engines to create images for facial recognition or for video games is interesting. The write up mentions a number of companies pushing forward in this field; for example, Cvedia.

However, the use of NuTech’s methods populated databases of fact. I think the use of synthetic methods has a bright future. Oh, NuTech was acquired by Netezza. Guess what company owns the prescient NuTech Solutions’ technology? Give up? IBM, a company which has potent capabilities but does the most unusual things with those important systems and methods.

I suppose that is one reason why old wine looks like new IBM Holiday Spirit rum.

Stephen E Arnold, February 5, 2017

School Technology: Making Up Performance Data for Years

February 9, 2024

green-dino_thumb_thumb_thumbThis essay is the work of a dumb dinobaby. No smart software required.

What is the “make up data” trend? Why is it plaguing educational institutions. From Harvard to Stanford, those who are entrusted with shaping young-in-spirit minds are putting ethical behavior in the trash can. I think I know, but let’s look at allegations of another “synthetic” information event. For context in the UK there is a government agency called the Office for Standards in Education, Children’s Services and Skills.” The agency is called OFSTED. Now let’s go to the “real” news story.“

image

A possible scene outside of a prestigious academic institution when regulations about data become enforceable… give it a decade or two. Thanks, MidJourney. Two tries and a good enough illustration.

Ofsted Inspectors Make Up Evidence about a School’s Performance When IT Fails” reports:

Ofsted inspectors have been forced to “make up” evidence because the computer system they use to record inspections sometimes crashes, ­wiping all the data…

Quite a combo: Information technology and inventing data.

The article adds:

…inspectors have to replace those notes from memory without telling the school.

Will the method work for postal investigations? Sure. Can it be extended to other activities? What about data pertinent to the UK government initiates for smart software?

Stephen E Arnold, February 9, 2024

A Xoogler Explains Why Big Data Is Going Nowhere Fast

March 3, 2023

The essay “Big Data Is Dead.” One of my essays from the Stone Age of Online used the title “Search Is Dead” so I am familiar with the trope. In a few words, one can surprise. Dead. Final. Absolute, well, maybe. On the other hand, the subject either Big Data or Search are part of the woodwork in the mini-camper of life.

I found this statement interesting:

Modern cloud data platforms all separate storage and compute, which means that customers are not tied to a single form factor. This, more than scale out, is likely the single most important change in data architectures in the last 20 years.

The cloud is the future. I recall seeing price analyses of some companies’ cloud activities; for example, “The Cloud vs. On-Premise Cost: Which One is Cheaper?” In my experience, cloud computing was pitched as better, faster, and cheaper. Toss in the idea that one can get rid of pesky full time systems personnel, and the cloud is a win.

What the cloud means is exactly what the quoted sentence says, “customers are not tied to a single form factor.” Does this mean that the Big Data rah rah combined with the sales pitch for moving to the cloud will set the stage for more hybrid sets up a return to on premises computing. Storage could become a combination of on premises and cloud base solutions. The driver, in my opinion, will be cost. And one thing the essay about Big Data does not dwell on is the importance of cost in the present economic environment.

The arguments for small data or subsets of Big Data is accurate. My reading of the essay is that some data will become a problem: Privacy, security, legal, political, whatever. The essay is an explanation for what “synthetic data.” Google and others want to make statistically-valid, fake data the gold standard for certain types of processes. In the data are a liability section of the essay, I noted:

Data can suffer from the same type of problem; that is, people forget the precise meaning of specialized fields, or data problems from the past may have faded from memory.

I wonder if this is a murky recasting of Google’s indifference to “old” data and to date and time stamping. The here and now not then and past are under the surface of the essay. I am surprised the notion of “forward forward” analysis did not warrant a mention. Outfits like Google want to do look ahead prediction in order to deal with inputs newer than what is in the previous set of values.

You may read the essay and come away with a different interpretation. For me, this is the type of analysis characteristic of a Googler, not a Xoogler. If I am correct, why doesn’t the essay hit the big ideas about cost and synthetic data directly?

Stephen E Arnold, March 3, 2023

How about This Intelligence Blindspot: Poisoned Data for Smart Software

February 23, 2023

One of the authors is a Googler. I think this is important because the Google is into synthetic data; that is, machine generated information for training large language models or what I cynically refer to as “smart software.”

The article / maybe reproducible research is “Poisoning Web Scale Datasets Is Practical.”  Nine authors of whom four are Googlers have concluded that a bad actor, government, rich outfit, or crafty students in Computer Science 301 can inject information into content destined to be used for training. How can this be accomplished. The answer is either by humans, ChatGPT outputs from an engineered query, or a combination. Why would someone want to “poison” Web accessible or thinly veiled commercial datasets? Gee, I don’t know. Oh, wait, how about control information and framing of issues? Nah, who would want to do that?

The paper’s authors conclude with more than one-third of that Google goodness. No, wait. There are no conclusions. Also, there are no end notes. What there is a road map explaining the mechanism for poisoning.

One key point for me is the question, “How is poisoning related to the use of synthetic data?”

My hunch is that synthetic data are more easily manipulated than going through the hoops to poison publicly accessible data. That’s time and resource intensive. The synthetic data angle makes it more difficult to identify the type of manipulations in the generation of a synthetic data set which could be mingled with “live” or allegedly-real data.

Net net: Open source information and intelligence may have a blindspot because it is not easy to determine what’s right, accurate, appropriate, correct, or factual. Are there implications for smart machine analysis of digital information? Yep, in my opinion already flawed systems will be less reliable and the users may not know why.

Stephen E Arnold, February 23, 2023

Synthetic Content: A Challenge with No Easy Answer

January 30, 2023

Open source intelligence is the go-to method for many crime analysts, investigators, and intelligence professionals. Whether social media or third-party data from marketing companies, useful insights can be obtained. The upside of OSINT means that many of its supporters downplay or choose to sidestep its downsides. I call this “OSINT blindspots”, and each day I see more information about what is becoming a challenge.

For example, “As Deepfakes Flourish, Countries Struggle with Response” is a useful summary of one problem posed by synthetic (fake) content. What looks “real” may not be. A person sifting through data assumes that information is suspect. Verification is needed. But synthetic data can output multiple instances of fake information and then populate channels with “verification” statements of the initial item of information.

The article states:

Deepfake technology — software that allows people to swap faces, voices and other characteristics to create digital forgeries — has been used in recent years to make a synthetic substitute of Elon Musk that shilled a crypto currency scam, to digitally “undress” more than 100,000 women on Telegram and to steal millions of dollars from companies by mimicking their executives’ voices on the phone. In most of the world, authorities can’t do much about it. Even as the software grows more sophisticated and accessible, few laws exist to manage its spread.

For some government professionals, the article says:

problematic applications are also plentiful. Legal experts worry that deepfakes could be misused to erode trust in surveillance videos, body cameras and other evidence. (A doctored recording submitted in a British child custody case in 2019 appeared to show a parent making violent threats, according to the parent’s lawyer.) Digital forgeries could discredit or incite violence against police officers, or send them on wild goose chases. The Department of Homeland Security has also identified risks including cyber bullying, blackmail, stock manipulation and political instability.

The most interesting statement in the essay, in my opinion, is this one:

Some experts predict that as much as 90 per cent of online content could be synthetically generated within a few years.

The number may overstate what will happen because no one knows the uptake of smart software and the applications to which the technology will be put.

Thinking in terms of OSINT blindspots, there are some interesting angles to consider:

  1. Assume the write up is correct and 90 percent of content is authored by smart software, how does a person or system determine accuracy? What happens when a self learning system learns from itself?
  2. How does a human determine what is correct or incorrect? Education appears to be struggling to teach basic skills? What about journals with non reproducible results which spawn volumes of synthetic information about flawed research? Is a person, even one with training in a narrow discipline, able to determine “right” or “wrong” in a digital environment?
  3. Are institutions like libraries being further marginalized? The machine generated content will exceed a library’s capacity to acquire certain types of information? Does one acquire books which are “right” when machine generated content produces information that shouts “wrong”?
  4. What happens to automated sense making systems which have been engineered on the often flawed assumption that available data and information are correct?

Perhaps an OSINT blind spot is a precursor to going blind, unsighted, or dark?

Stephen E Arnold, January 30, 2023

A Data Taboo: Poisoned Information But We Do Not Discuss It Unless… Lawyers

October 25, 2022

In a conference call yesterday (October 24, 2022), I mentioned one of my laws of online information; specifically, digital information can be poisoned. The venom can be administered by a numerically adept MBA or a junior college math major taking short cuts because data validation is hard work. The person on the call was mildly surprised because the notion of open source and closed source “facts” intentionally weaponized is an uncomfortable subject. I think the person with whom I was speaking blinked twice when I pointed what should be obvious to most individuals in the intelware business. Here’s the pointy end of reality:

Most experts and many of the content processing systems assume that data are good enough. Plus, with lots of data any irregularities are crunched down by steamrolling mathematical processes.

The problem is that articles like “Biotech Firm Enochian Says Co Founder Fabricated Data” makes it clear that MBA math as well as experts hired to review data can be caught with their digital clothing in a pile. These folks are, in effect, sitting naked in a room with people who want to make money. Nakedness from being dead wrong can lead to some career turbulence; for example, prison.

The write up reports:

Enochian BioSciences Inc. has sued co-founder Serhat Gumrukcu for contractual fraud, alleging that it paid him and his husband $25 million based on scientific data that Mr. Gumrukcu altered and fabricated.

The article does not explain precisely how the data were “fabricated.” However, someone with Excel skills or access to an article like “Top 3 Python Packages to Generate Synthetic Data” and Fiverr.com or similar gig work site can get some data generated at a low cost. Who will know? Most MBAs math and statistics classes focus on meeting targets in order to get a bonus or amp up a “service” fee for clicking a mouse. Experts who can figure out fiddled data sets take the time if they are motivated by professional jealousy or cold cash. Who blew the whistle on Theranos? A data analyst? Nope. A “real” journalist who interviewed people who thought something was goofy in the data.

My point is that it is trivially easy to whip up data to support a run at tenure or at a group of MBAs desperate to fund the next big thing as the big tech house of cards wobbles in the winds of change.

Several observations:

  1. The threat of bad or fiddled data is rising. My team is checking a smart output by hand because we simply cannot trust what a slick, new intelware system outputs. Yep, trust is in short supply among my research team.
  2. Individual inspection of data from assorted open and closed sources is accepted as is. The attitude is that the law of big numbers, the sheer volume of data, or the magic of cross correlation will minimize errors. Sure these processes will, but what if the data are weaponized and crafted to avoid detection? The answer is to check each item. How’s that for a cost center?
  3. Uninformed individuals (yep, I am including some data scientists, MBAs, and hawkers of data from app users) don’t know how to identify weaponized data nor know what to do when such data are identified.

Does this suggest that a problem exists? If yes, what’s the fix?

[a] Ignore the problem

[b] Trust Google-like outfits who seek to be the source for synthetic data

[c] Rely on MBAs

[d] Rely on jealous colleagues in the statistics department with limited tenure opportunities

[e] Blink.

Pick one.

Stephen E Arnold, October 25, 2022

How Apps Use Your Data: Just a Half Effort

April 28, 2022

I read an quite enthusiastic article called “Google Forces Developers to Provide Details on How Apps Use Your Data.” The main idea is virtue signaling with one of those flashing airport beacons. These can be seen through certain types of “info fog,” just not today’s info fog. The digital climate has a number of characteristics. One is obfuscation.

The write up states:

… the Data safety feature is now on the Google Play Store and aims to bolster security by providing users details on how an app is using their information. Developers are required to complete this section for their apps by July 20, and will need to provide updates if they change their data handling practices, too. 

That sounds encouraging. Google’s been at the data harvesting combine controls for more than two decades. Now app developers have to provide information about their use of an app user’s data and presumably flip on the yellow fog lights for what the folks who have access to those data via an API or a bulk transfer are doing. Amusing thought forced regulation after 240 months on the info highway.

However, what app users do with data is half of the story, maybe less. The interesting question to me is, “What does Google do with those data?”

The Data Safety initiative does not focus on the Google. Data Safety shifts the attention to app developers, presumably some of whom have crafty ideas. My interest is Google’s own data surfing; for example, ad diffusion, and my fave Snorkelization and synthetic “close enough for horseshoes” data. Real data may be to “real” for some purposes.

After a couple of decades, Google is taking steps toward a data destination. I just don’t know where that journey is taking people.

Stephen E Arnold, April 28, 2022

Why Be Like ClearView AI? Google Fabs Data the Way TSMC Makes Chips

April 8, 2022

Machine learning requires data. Lots of data. Datasets can set AI trainers back millions of dollars, and even that does not guarantee a collection free of problems like bias and privacy issues. Researchers at MIT have developed another way, at least when it comes to image identification. The World Economic Forum reports, “These AI Tools Are Teaching Themselves to Improve How they Classify Images.” Of course, one must start somewhere, so a generative model is first trained on some actual data. From there, it generates synthetic data that, we’re told, is almost indistinguishable from the real thing. Writer Adam Zewe cites the paper‘s lead author Ali Jahanian as he emphasizes:

“But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can ‘imagine’ how a car would look in different situations — situations it did not see during training — and then output images that show the car in unique poses, colors, or sizes. Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different. The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains. ‘This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations,’ he says.”

Ah, algorithmic teamwork. Another advantage of this method is the nearly infinite samples the model can generate, since more samples (usually) make for a better trained AI. Jahanian also notes once a generative model has created a repository of synthetic data, that resource can be posted online for others to use. The team also hopes to use their technique to generate corner cases, which often cannot be learned from real data sets and are especially troublesome when it comes to potentially dangerous uses like self-driving cars. If this hope is realized, it could be a huge boon.

This all sounds great, but what if—just a minor if—the model is off base? And, once this tech moves out of the laboratory, how would we know? The researchers acknowledge a couple other limitations. For one, their generative models occasionally reveal source data, which negates the privacy advantage. Furthermore, any biases in the limited datasets used for the initial training will be amplified unless the model is “properly audited.” It seems like transparency, which somehow remains elusive in commercial AI applications, would be crucial. Perhaps the researchers have an idea how to solve that riddle.

Funding for the project was supplied, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator.

Cynthia Murrell, April 8, 2022

Datasets: An Analysis Which Tap Dances around Some Consequences

December 22, 2021

I read “3 Big Problems with Datasets in AI and Machine Learning.” The arguments presented support the SAIL, Snorkel, and Google type approach to building datasets. I have addressed some of my thoughts about configuring once and letting fancy math do the heavy lifting going forward. This is probably not the intended purpose of the Venture Beat write up. My hunch is that pointing out other people’s problems frames the SAIL, Snorkel, and Google type approaches. No one asks, “What happens if the SAIL, Snorkel, and Google type approaches don’t work or have some interesting downstream consequences?” Why bother?

Here are the problems as presented by the cited article:

  1. The Training Dilemma. The write up says: “History is filled with examples of the consequences of deploying models trained using flawed datasets.” That’s correct. The challenge is that creating and validating a training set for a discipline, topic, or “space” is that new content arrives using new lingo and even metaphors instead of words like “rock.” Building a dataset and doing what informed people from the early days of Autonomy’s neuro-linguistic method know is that no one wants to spend money, time, and computing resources in endless Sisyphean work. That rock keeps rolling back down the hill. This is a deal breaker, so considerable efforts has been expended figuring out how to cut corners, use good enough data, set loose shoes thresholds, and rely on normalization to smooth out the acne scars. Thus, we are in an era of using what’s available. Make it work or become a content creator on TikTok.
  2. Issues with Labeling. I don’t like it when the word “indexing” is replaced with works like labels, metatags, hashtags, and semantic sign posts. Give me a break. Automatic indexing is more consistent than human indexers who get tired and fall back on a quiver of terms because who wants to work too hard at a boring job for many. But the automatic systems are in the same “good enough” basket as smart training data set creation. The problem is words and humans. Software is clueless when it comes to snide remarks, cynicism, certain types of fake news and bogus research reports in peer reviewed journals, etc. Indexing using esoteric words means the Average Joe and Janet can’t find the content. Indexing with everyday words means that search results work great for pizza near me but no so well for beatles diet when I want food insects eat, not what kept George thin. The write up says: “Still other methods aim to replace real-world data with partially or entirely synthetic data — although the jury’s out on whether models trained on synthetic data can match the accuracy of their real-world-data counterparts.” Yep, let’s make up stuff.
  3. A Benchmarking Problem. The write up asserts: “SOTA benchmarking [also] does not encourage scientists to develop a nuanced understanding of the concrete challenges presented by their task in the real world, and instead can encourage tunnel vision on increasing scores. The requirement to achieve SOTA constrains the creation of novel algorithms or algorithms which can solve real-world problems.” Got that. My view is that validating data is a bridge too far for anyone except a graduate student working for a professor with grant money. But why benchmark when one can go snorkeling? The reality is that datasets are in most cases flawed but no one knows how flawed. Just use them and let the results light the path forward. Cheap and sounds good when couched in jargon.

What’s the fix? The fix is what I call the SAIL, Snorkel, and Google type solution. (Yep, Facebook digs in this sandbox too.)

My take is easily expressed just not popular. Too bad.

  1. Do the work to create and validate a training set. Rely on subject matter experts to check outputs and when the outputs drift, hit the brakes, and recalibrate and retrain.
  2. Admit that outputs are likely to be incomplete, misleading, or just plain wrong. Knock of the good enough approach to information.
  3. Return to methods which require thresholds to be be validated by user feedback and output validity. Letting cheap and fast methods decide which secondary school teacher gets fired strikes me as not too helpful.
  4. Make sure analyses of solutions don’t functions as advertisements for the world’s largest online ad outfit.

Stephen E Arnold, December 22, 2021

Startup Gretel Building Anonymized Data Platform

March 19, 2020

There is a lot of valuable but sensitive data out there that developers and engineers would love to get their innovative hands on, but it is difficult to impossible for them to access. Until now.

Enter Gretel, a startup working to anonymize confidential data. We learn about the upcoming platform from Inventiva’s article, “A Group of Ex-NSA And Amazon Engineers Are Building a ‘GitHub for Data’.” Co-founders Alex Watson, John Myers, Ali Golshan, and Laszlo Bock were inspired by the source code sharing platform GitHub. Reporter surbhi writes:

“Often, developers don’t need full access to a bank of user data — they just need a portion or a sample to work with. In many cases, developers could suffice with data that looks like real user data. … ‘We’re building right now software that enables developers to automatically check out an anonymized version of the data set,’ said Watson. This so-called ‘synthetic data’ is essentially artificial data that looks and works just like regular sensitive user data. Gretel uses machine learning to categorize the data — like names, addresses and other customer identifiers — and classify as many labels to the data as possible. Once that data is labeled, it can be applied access policies. Then, the platform applies differential privacy — a technique used to anonymize vast amounts of data — so that it’s no longer tied to customer information. ‘It’s an entirely fake data set that was generated by machine learning,’ said Watson.”

The founders are not the only ones who see the merit in this idea; so far, the startup has raised $3.5 million in seed funding. Gretel plans to charge users based on consumption, and the team hopes to make the platform available within the next six months.

Cynthia Murrell, March 19, 2020

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta