New ChatGPT model reduces unsafe replies by up to 80%
ChatGPT now encourages real-world connection and offers grounding steps for distress, while ongoing evaluation will refine safeguards and taxonomies.
OpenAI has updated ChatGPT’s default model after working with more than 170 mental health clinicians to help the system better spot distress, de-escalate conversations and point users to real-world support.
The update routes sensitive exchanges to safer models, expands access to crisis hotlines and adds gentle prompts to take breaks, aiming to reduce harmful responses rather than simply offering more content.
Measured improvements are significant across three priority areas: severe mental health symptoms such as psychosis and mania, self-harm and suicide, and unhealthy emotional reliance on AI.
OpenAI reports that undesired responses fell between 65 and 80 percent in production traffic and that independent clinician reviews show significant gains compared with earlier models. At the same time, rare but high-risk scenarios remain a focus for further testing.
The company used a five-step process to shape the changes: define harms, measure them, validate approaches with experts, mitigate risks through post-training and product updates, and keep iterating.
Evaluations combine real-world traffic estimates with structured adversarial tests, so better ChatGPT safeguards are in place now, and further refinements are planned as understanding and measurement methods evolve.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!
