OpenAI strengthened ChatGPT Atlas with new protections against prompt injection attacks

The latest Atlas rollout builds defence directly into the model, so prompt injection attempts hidden inside webpages, documents or emails are detected instead of silently executed.

OpenAI upgraded ChatGPT Atlas security after developing automated red-team systems that uncover sophisticated prompt injection strategies, allowing the AI agent to resist powerful, multi-step adversarial attacks.

Protecting AI agents from manipulation has become a top priority for OpenAI after rolling out a major security upgrade to ChatGPT Atlas.

The browser-based agent now includes stronger safeguards against prompt injection attacks, where hidden instructions inside emails, documents or webpages attempt to redirect the agent’s behaviour instead of following the user’s commands.

Prompt injection poses a unique risk because Atlas can carry out actions that a person would normally perform inside a browser. A malicious email or webpage could attempt to trigger data exposure, unauthorised transactions or file deletion.

Criminals exploit the fact that agents process large volumes of content across an almost unlimited online surface.

OpenAI has developed an automated red-team framework that uses reinforcement learning to simulate sophisticated attackers.

When fresh attack patterns are discovered, the models behind Atlas are retrained so that resistance is built into the agent rather than added afterwards. Monitoring and safety controls are also updated using real attack traces.

These new protections are already live for all Atlas users. OpenAI advises people to limit logged-in access where possible, check confirmation prompts carefully and give agents well-scoped tasks instead of broad instructions.

The company argues that proactive defence is essential as agentic AI becomes more capable and widely deployed.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!