January 29, 2024

One bad apple does not a failed harvest make. Let’s hope. I read “Poisoned AI Went Rogue During Training and Couldn’t Be Taught to Behave Again in Legitimately Scary Study.” In several of my lectures in 2023 I included a section about poisoned data. When I described the method and provided some examples of content injection, the audience was mostly indifferent. When I delivered a similar talk in October 2023, those in my audience were attentive. The concept of intentionally fooling around with model thresholds, data used for training, and exploiting large language model developers’ efforts to process more current or what some call “real time” data hit home. For each of these lectures, my audience was composed of investigators and intelligence analysts.


How many bad apples are in the spectrum of smart software? Give up. Don’t feel bad. No one knows. Perhaps it is better to ignore the poisoned data problem? There is money to be made and innovators to chase the gold rush. Thanks, MSFT Copilot Bing thing. How is your email security? Oh, good enough, like the illustration with lots of bugs.

Write ups like “Poisoned AI Went Rogue…” add a twist to my tales. Specifically a function chunk of smart software began acting in a manner not only surprising but potentially harmful. The write up in LiveScience asserted:

AI researchers found that widely used safety training techniques failed to remove malicious behavior from large language models — and one technique even backfired, teaching the AI to recognize its triggers and better hide its bad behavior from the researchers.

Interesting. The article noted:

Artificial intelligence (AI) systems that were trained to be secretly malicious resisted state-of-the-art safety methods designed to "purge" them of dishonesty …  Researchers programmed various large language models (LLMs) — generative AI systems similar to ChatGPT — to behave maliciously. Then, they tried to remove this behavior by applying several safety training techniques designed to root out deception and ill intent. They found that regardless of the training technique or size of the model, the LLMs continued to misbehave.

Evan Hubinger, an artificial general intelligence safety research scientist at Anthropic, is quoted as saying:

"I think our results indicate that we don’t currently have a good defense against deception in AI systems — either via model poisoning or emergent deception — other than hoping it won’t happen…  And since we have really no way of knowing how likely it is for it to happen, that means we have no reliable defense against it. So I think our results are legitimately scary, as they point to a possible hole in our current set of techniques for aligning AI systems."

If you want to read the research paper, you can find it at this link. Note that one of the authors is affiliated with the Amazon- and Google-supported Anthropic AI company.

Net net: We do not have at this time a “good defense” against this type of LLM poisoning. Do I have a clever observation, some words of reassurance, or any ideas for remediation?


