31 Oct 2025

OpenAI unveils new gpt-oss-safeguard models for adaptive content safety

Developers can apply evolving safety policies directly to messages and reviews using gpt-oss-safeguard, providing flexible moderation and detailed explanations of the model’s decision-making process.

Yesterday, OpenAI launched gpt-oss-safeguard, a pair of open-weight reasoning models designed to classify content according to developer-specified safety policies.

Available in 120b and 20b sizes, these models allow developers to apply and revise policies during inference instead of relying on pre-trained classifiers.

They produce explanations of their reasoning, making policy enforcement transparent and adaptable. The models are downloadable under an Apache 2.0 licence, encouraging experimentation and modification.

The system excels in situations where potential risks evolve quickly, data is limited, or nuanced judgements are required.

Unlike traditional classifiers that infer policies from pre-labelled data, gpt-oss-safeguard interprets developer-provided policies directly, enabling more precise and flexible moderation.

The models have been tested internally and externally, showing competitive performance against OpenAI’s own Safety Reasoner and prior reasoning models. They can also support non-safety tasks, such as custom content labelling, depending on the developer’s goals.

OpenAI developed these models alongside ROOST and other partners, building a community to improve open safety tools collaboratively.

While gpt-oss-safeguard is computationally intensive and may not always surpass classifiers trained on extensive datasets, it offers a dynamic approach to content moderation and risk assessment.

Developers can integrate the models into their systems to classify messages, reviews, or chat content with transparent reasoning instead of static rule sets.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!