AI models face collapse when trained on AI-generated data, study finds

A new study published in Nature reveals that AI models, including large language models (LLMs), rapidly degrade in quality when trained on data generated by previous AI models.Â

This phenomenon, termed â€œmodel collapse,â€ could erode the quality of future AI models, particularly as more AI-generated content is released onto the internet and, therefore, recycled and reused in model training data.Â

Investigating this phenomenon, researchers from the University of Cambridge, University of Oxford, and other institutions conducted experiments showing that when AI models are repeatedly trained on data produced by earlier versions of themselves, they start generating nonsensical outputs.Â

This was observed across different types of AI models, including language models, variational autoencoders, and Gaussian mixture models.

In one key experiment with language models, the team fine-tuned the OPT-125m model on the WikiText-2 dataset and then used it to generate new text.

This AI-generated text was then used to train the next â€œgenerationâ€ of the model, and the process was repeated over and over.Â

It wasnâ€™t long before models started producing increasingly improbable and nonsensical text.Â

By the ninth generation, the model was generating complete gibberish, such as listing multiple non-existent types of â€œjackrabbitsâ€ when prompted about English church towers.

The researchers also observed how models lose information about â€œrareâ€ or infrequent events before complete collapse.Â

This is alarming, as rare events often relate to marginalized groups or outliers. Without them, models risk concentrating their responses across a narrow spectrum of ideas and beliefs, thus reinforcing biases.

AI companies are aware of this, hence why theyâ€™re striking deals with news companies and publishers to secure a steady stream of high-quality, human-written, topically relevant information.Â

â€œThe message is, we have to be very careful about what ends up in our training data,â€ study co-author Zakhar Shumaylov from the University of Cambridge told Nature. â€œOtherwise, things will always, provably, go wrong.â€

Compounding this effect, a recent study by Dr. Richard Fletcher, Director of Research at the Reuters Institute for the Study of Journalism, found that nearly half (48%) of the most popular news sites worldwide are now inaccessible to OpenAIâ€™s crawlers, with Googleâ€™s AI crawlers being blocked by 24% of sites.

As a result, AI models have access to a smaller pool of high-quality, recent data than they once did, increasing the risk of training on sub-standard or outdated data.Â

Solutions to model collapse

Regarding solutions, the researchers state that maintaining access to original, human-generated data sources is vital for AIâ€™s future.Â

Tracking and managing AI-generated content would also be helpful to prevent it from accidentally contaminating training datasets. That would be very tricky, as AI-generated content is becoming impossible to detect.Â

Researchers posit four main solutions:

Watermarking AI-generated content to distinguish it from human-created data
Creating incentives for humans to continue producing high-quality content
Developing more sophisticated filtering and curation methods for training data
Exploring ways to preserve and prioritize access to original, non-AI-generated information

Model collapse is a real problem

This study is far from the only one exploring model collapse.Â

Not long ago, Stanford researchers compared two scenarios in which model collapse might occur: one where each new model iterationâ€™s training data fully replaced the previous data and another where synthetic data is added to the existing dataset.

When data was replaced, model performance deteriorated rapidly across all tested architectures.Â

However, when data was allowed to â€œaccumulate,â€ model collapse was largely avoided. The AI systems maintained their performance and, in some cases, showed improvements.

So, despite credible concerns, model collapse isnâ€™t a foregone conclusion â€“ it depends on how much AI-generated data is in the set and the ratio of synthetic to authentic data.Â

If and when model collapse starts to become evident in frontier models, you can be certain that AI companies will be scrambling for a long-term solution.Â

Weâ€™re not there yet, but it might be a matter of when, not if.

The post AI models face collapse when trained on AI-generated data, study finds appeared first on DailyAI.

Source: Read MoreÂ

Sunshine And March Vibes (2025 Wallpapers Edition)

The Case For Minimal WordPress Setups: A Contrarian View On Theme Frameworks

How To Fix Largest Contentful Paint Issues With Subpart Analysis

How To Prevent WordPress SQL Injection Attacks

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

Minecraft licensing robbed us of this controversial NFL schedule release video

The power of generators

The power of generators

Simplify Factory Associations with Laravel’s UseFactory Attribute

This Week in Laravel: React Native, PhpStorm Junie, and more

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Microsoft has closed its “Experience Center” store in Sydney, Australia — as it ramps up a continued digital growth campaign

Bing Search APIs to be “decommissioned completely” as Microsoft urges developers to use its Azure agentic AI alternative

Microsoft might kill the Surface Laptop Studio as production is quietly halted

AI models face collapse when trained on AI-generated data, study finds

Solutions to model collapse

Model collapse is a real problem

Nmap 7.96 Launches with Lightning-Fast DNS and 612 Scripts

CVE-2025-40906 – MongoDB BSON Serialization BSON::XS Multiple Vulnerabilities

Are AI-RAG Solutions Really Hallucination-Free? Researchers at Stanford University Assess the Reliability of AI in Legal Research: Hallucinations and Accuracy Challenges

Best Free and Open Source Alternatives to Google Analytics

CVE-2025-46533 – WordPress wpdrift.no Stored Cross-site Scripting (XSS)

OptiImage – GUI image compressor

What exactly is Once Human? Whatever it is, it’s really good

Introducing Curated Solutions for Databases on AWS

How to Navigate the VMware License Cost Increase

New Guide Explains How to Eliminate the Risk of Shadow SaaS and Protect Corporate Data

AI models face collapse when trained on AI-generated data, study finds

Solutions to model collapse

Model collapse is a real problem

Related Posts