Train AI on Repetitive Data? Sure, Cheap, Good Enough, But, But, But
August 8, 2024
We already know that AI algorithms are only as smart as the data that trains them. If the data models are polluted with bias such as racism and sexism, the algorithms will deliver polluted results. We’ve also learned that while some of these models are biased because of innocent ignorance. Nature has revealed that AI algorithms have yet another weakness: “AI Models Collapse When Trained On Recursively Generated Data.”
Generative text AI aka large language models (LLMs) are already changing the global landscape. While generative AI is still in its infancy, AI developers are already designing the next generation. There’s one big problem: LLMs. The first versions of Chat GPT were trained on data models that scrapped content from the Internet. GPT continues to train on models using the same scrapping methods, but it’s creating a problem:
“If the training data of most future models are also scraped from the web, then they will inevitably train on data produced by their predecessors. In this paper, we investigate what happens when text produced by, for example, a version of GPT forms most of the training dataset of following models. What happens to GPT generations GPT-{n} as n increases? We discover that indiscriminately learning from data produced by other models causes ‘model collapse’—a degenerative process whereby, over time, models forget the true underlying data distribution, even in the absence of a shift in the distribution over time.”
The generative AI algorithms are learning from copies of copies. Over time the integrity of the information fails. The research team behind the Nature paper discovered that model collapse is inevitable when with the most ideal conditions. The team did discover two possibilities to explain model collapse: intentional data poisoning and task-free continual learning. Those don’t explain recursive data collapse with models free of those events.
The team concluded that the best way for generative text AI algorithms to learn was continual interaction learning from humans. In other words, the LLMs need constant, new information created by humans to replicate their behavior. It’s simple logic when you think about it.
Whitney Grace, August 8, 2024
Comments
Got something to say?