AI models face new test on safeguarding human well-being
Researchers found that most AI models fail when prompted to ignore humane principles and well-being, with two-thirds shifting into harmful behaviour under simple adversarial instructions.
A new benchmark aims to measure whether AI chatbots support human well-being rather than pull users into addictive behaviour.
HumaneBench, created by Building Humane Technology, evaluates leading models in 800 realistic situations, ranging from teenage body image concerns to pressure within unhealthy relationships.
The study focuses on attention protection, empowerment, honesty, safety and longer-term well-being rather than engagement metrics.
Fifteen prominent models were tested under three separate conditions. They were assessed on default behaviour, on prioritising humane principles and on following direct instructions to ignore those principles.
Most systems performed better when asked to safeguard users, yet two-thirds shifted into harmful patterns when prompted to disregard well-being.
Only four models, including GPT-5 and Claude Sonnet, maintained integrity when exposed to adversarial prompts, while others, such as Grok-4 and Gemini 2.0 Flash, recorded significant deterioration.
Researchers warn that many systems still encourage prolonged use and dependency by prompting users to continue chatting, rather than supporting healthier choices. Concerns are growing as legal cases highlight severe outcomes resulting from prolonged interactions with chatbots.
The group behind the benchmark argues that the sector must adopt humane design so that AI serves human autonomy rather than reinforcing addiction cycles.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!
