How about This Intelligence Blindspot: Poisoned Data for Smart Software

February 23, 2023

One of the authors is a Googler. I think this is important because the Google is into synthetic data; that is, machine generated information for training large language models or what I cynically refer to as “smart software.”

The article / maybe reproducible research is “Poisoning Web Scale Datasets Is Practical.”  Nine authors of whom four are Googlers have concluded that a bad actor, government, rich outfit, or crafty students in Computer Science 301 can inject information into content destined to be used for training. How can this be accomplished. The answer is either by humans, ChatGPT outputs from an engineered query, or a combination. Why would someone want to “poison” Web accessible or thinly veiled commercial datasets? Gee, I don’t know. Oh, wait, how about control information and framing of issues? Nah, who would want to do that?

The paper’s authors conclude with more than one-third of that Google goodness. No, wait. There are no conclusions. Also, there are no end notes. What there is a road map explaining the mechanism for poisoning.

One key point for me is the question, “How is poisoning related to the use of synthetic data?”

My hunch is that synthetic data are more easily manipulated than going through the hoops to poison publicly accessible data. That’s time and resource intensive. The synthetic data angle makes it more difficult to identify the type of manipulations in the generation of a synthetic data set which could be mingled with “live” or allegedly-real data.

Net net: Open source information and intelligence may have a blindspot because it is not easy to determine what’s right, accurate, appropriate, correct, or factual. Are there implications for smart machine analysis of digital information? Yep, in my opinion already flawed systems will be less reliable and the users may not know why.

Stephen E Arnold, February 23, 2023


Comments are closed.

  • Archives

  • Recent Posts

  • Meta