17 Jun 2025

ChatGPT and generative AI have polluted the internet — and may have broken themselves

Researchers warn that future AI could collapse by learning from its low-quality output, instead of authentic human knowledge.

The explosion of generative AI tools like ChatGPT has flooded the internet with low-quality, AI-generated content, making it harder for future models to learn from authentic human knowledge.

As AI continues to train on increasingly polluted data, a loop forms in which AI imitates already machine-made content, leading to a steady drop in originality and usefulness. The worrying trend is referred to as ‘model collapse’.

To illustrate the risk, researchers compare clean pre-AI data to ‘low-background steel’ — a rare kind of steel made before nuclear testing in 1945, which remains vital for specific medical and scientific uses.

Just as modern steel became contaminated by radiation, modern data is being tainted by artificial content. Cambridge researcher Maurice Chiodo notes that pre-2022 data is now seen as ‘safe, fine, clean’, while everything after is considered ‘dirty’.

A key concern is that techniques like retrieval-augmented generation, which allow AI to pull real-time data from the internet, risk spreading even more flawed content. Some research already shows that it leads to more ‘unsafe’ outputs.

If developers rely on such polluted data, scaling models by adding more information becomes far less effective, potentially hitting a wall in progress.

Chiodo argues that future AI development could be severely limited without a clean data reserve. He and his colleagues urge the introduction of clear labelling and tighter controls on AI content.

However, industry resistance to regulation might make meaningful reform difficult, raising doubts about whether the pollution can be reversed.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!