Is Smart Software Above Navel Gazing: Nope, and It Does Not Care
June 15, 2023
Note: This essay is the work of a real and still-alive dinobaby. No smart software involved, just a dumb humanoid.
Synthetic data. Statistical smoothing. Recursive methods. When we presented our lecture “OSINT Blindspots” at the 2023 National Cyber Crime Conference, the audience perked up. The terms might have been familiar, but our framing caught the more than 100 investigators’ attention. The problem my son (Erik) and I described was butt simple: Faked data will derail a prosecution if an expert witness explains that machine-generated output may be wrong.
We provided some examples, ranging from a respected executive who obfuscates his “real” business from a red-herring business. We profiled how information about a fervid Christian adherence to God’s precepts overshadowed a Ponzi scheme. We explained how an American living in Eastern Europe openly flaunts social norms in order to distract authorities from an encrypted email business set up to allow easy, seamless communication for interesting people. And we included more examples.
An executive at a big time artificial intelligence firm looks over his domain and asks himself, “How long will it take for the boobs and boobettes to figure out that our smart software is wonky?” The illustration was spit out by the clever bits and bytes at MidJourney.
What’s the point in this blog post? Who cares besides analysts, lawyers, and investigators who have to winnow facts which are verifiable from shadow or ghost information activities?
It turns out that a handful of academics seem to have an interest in information manipulation. Their angle of vision is broader than my team’s. We focus on enforcement; the academics focus on tenure or getting grants. That’s okay. Different points of view lead to interesting conclusions.
Consider this academic and probably tough to figure out illustration from “The Curse of Recursion: Training on Generated Data Makes Models Forget”:
A less turgid summary of the researchers’ findings appears at this location.
The main idea is that gee-whiz methods like Snorkel and small language models have an interesting “feature.” They forget; that is, as these models ingest fake data they drift, get lost, or go off the rails. Synthetic cloth, unlike natural cotton T shirts, look like shirts. But on a hot day, those super duper modern fabrics can cause a person to perspire and probably emit unusual odors.
The authors introduce and explain “model collapse.” I am no academic. My interpretation of the glorious academic prose is that the numerical recipes, systems, and methods don’t work like the nifty demonstrations. In fact, over time, the models degrade. The hapless humanoids who are dependent on these lack the means to figure out what’s on point and what’s incorrect. The danger, obviously, is that clueless and lazy users of smart software make more mistakes in judgment than a person might otherwise reach.
The paper includes fancy mathematics and more charts which do not exactly deliver on the promise of a picture is worth a thousand words. Let me highlight one statement from the journal article:
Our evaluation suggests a “first mover advantage” when it comes to training models such as LLMs. In our work we demonstrate that training on samples from another generative model can induce a distribution shift, which over time causes Model Collapse. This in turn causes the model to mis-perceive the underlying learning task. To make sure that learning is sustained over a long time period, one needs to make sure that access to the original data source is preserved and that additional data not generated by LLMs remain available over time. The need to distinguish data generated by LLMs from other data raises questions around the provenance of content that is crawled from the Internet: it is unclear how content generated by LLMs can be tracked at scale. One option is community-wide coordination to ensure that different parties involved in LLM creation and deployment share the information needed to resolve questions of provenance. Otherwise, it may become increasingly difficult to train newer versions of LLMs without access to data that was crawled from the Internet prior to the mass adoption of the technology, or direct access to data generated by humans at scale.
Bang on.
What the academics do not point out are some “real world” business issues:
- Solving this problem costs money; the point of synthetic and machine-generated data is to reduce costs. Cost reduction wins.
- Furthermore, fixing up models takes time. In order to keep indexes fresh, delays are not part of the game plan for companies eager to dominate a market which Accenture pegs as worth trillions of dollars. (See this wild and crazy number.)
- Fiddling around to improve existing models is secondary to capturing the hearts and minds of those eager to worship a few big outfits’ approach to smart software. No one wants to see the problem because that takes mental effort. Those inside one of firms vying to own information framing don’t want to be the nail that sticks up. Not only do the nails get pounded down, they are forced to leave the platform. I call this the Dr. Timnit Gebru effect.
Net net: Good paper. Nothing substantive will change in the short or near term.
Stephen E Arnold, June 15, 2023