AI Model Collapse Explained
- By Paul Mah
- August 21, 2024
Excitement around GenAI is at an all-time high as the world rushes to build ever more powerful generative AI models. However, concerns are emerging about a potential challenge known as "model collapse," buoyed by a growing body of research into this very topic.
This hypothetical scenario suggests that as AI-generated content becomes more prevalent online, future AI systems could degrade in performance as they are unwittingly trained on AI-generated data.
What exactly is model collapse, and how might it be prevented?
Understanding Model Collapse
A report on The Conversation dived into the topic of model collapse this week, noting how current GenAI models rely on vast amounts of high-quality data.
X user Aaron Snoswell attempts to highlight just how much data is needed in a tweet: “To train GPT-3, OpenAI needed over 650 billion English words of text – about 200x more than the entire English Wikipedia. But this required collecting almost 100x more raw data from the internet, up to 98% of which was then filtered and discarded.”
In a nutshell, AI training requires huge amounts of raw data because not all data will meet quality benchmarks. I’ve previously written about how technology giants have been accused of cutting corners to harvest sufficient high-quality data, and how OpenAI has refused to confirm or deny it scraped content from YouTube to train ChatGPT.
Without access to high-quality human data, research shows that AI models trained exclusively on AI data get worse over time. The result is a reduction in the quality and diversity of model behavior, notes The Conversation.
When I spoke with Dr. Leslie Teo, Senior Director of AI Products at AI Singapore about the rampant misuse of data, he told me his team was well aware of "what others are doing." And he should know: he leads the project to train SEA-LION, an AI model designed for the Southeast Asian region.
Avoiding Model Collapse
One way that AI firms are attempting to avert model collapse is by signing agreements with publishers to access their proprietary collections of human-created content. Will this be adequate? It's still too early to tell at this point.
However, there are also those who believe that the dangers of model collapse are overstated. They reason that human and AI data will accumulate in parallel, including human-edited AI content, reducing the likelihood of collapse.
Another possibility would be that AI firms incorporate text watermarking into content produced by their AI models. Indeed, OpenAI had already confirmed the availability of a working text watermarking technology it developed after a report by the Wall Street Journal. Might it be mulling whether to release the technology secretly for its own use instead?
Ultimately, even if we were to halt all AI training today, or if we soon reach the maximum potential of deep learning, the numerous models we've developed so far are probably more than sufficient to transform industries and redefine the future for years – or even decades – to come.
_Image credit: iStock/Andrzej Rostek
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.