Why Letting AI Train on Its Own Data Could Lead to “Model Collapse”
Model Collapse – A Primer
- Now scientists from Oxford and other institutions warn AI models could hit a problem they’re calling “model collapse”.
- As AI systems learn from data produced by other AIs instead of original human-created content.
Understanding AI Training
- AI models work by identifying patterns in the data they are taught to recognize.
- So, for instance, if you ask an AI how to make a snickerdoodle it will provide the most frequent one seen in its training data.
Issue With AI-Generated Contents
- The web is becoming increasingly saturated with work produced by algorithms.
- If later AIs are trained on these patterns that AI was taught by other men, the AIs themselves begin to believe this is more common than it truly is.
Illustration with Dogs

Image Courtesy – Nature.com
- AI-generated images are influencing how AI “thinks” – If an AI is trained on them, something like it will see a picture of one thing and instantly think of another.
- For example: if many – but really all – dogs in web search results look like golden retrievers due to the use there by people making fake pictures with no humans refused, then most dogs looked at image, however not photographed golden breeders they vision.
- With time, this misinformation adds up to make the AI worse at understanding dogs are as they really exist.
The Vicious Cycle
- AI-generated content is teaching AI models that the world works according to how it appears in rehearsed data.
- This creates a downward spiral where AIs get dumber and dirtier, called model collapse.
Why It Matters
- All these disallowances may lead to future AIs that just don’t always work.
- Good, broad human-generated data is important to the functioning of AIs.
Possible Solutions
- This would be a terrible outcome because it is the fault of training AIs with this AI-generated data but scientists say one solution could be marking these types of models.
- To ensure AI quality, companies need to protect and spread real human data.
Conclusion
- The concern is that such toolkits might eventually end up only being able to generate synthetic data from their datasets, due to the so-called “model collapse”.
- The prepared AI training data should be protected for ensuring the quality of output results from this bias.