Anthropic outlines AI agent workflows for scientific computing
A new post from Anthropic outlines how AI agents can support scientific computing through testing, memory, and structured workflows.
Anthropic has published a post describing how AI agents can be used in multi-day coding workflows for well-scoped, measurable scientific computing tasks that do not require constant human supervision. In the article, Anthropic researcher Siddharth Mishra-Sharma explains how tools such as progress files, test oracles, and orchestration methods can be used to manage long-running software work.
Mishra-Sharma writes that many scientists still use AI agents in a tightly managed conversational loop, while newer models are enabling the assignment of high-level goals and allowing agents to work more autonomously over longer periods. He says this approach can be useful for tasks such as reimplementing numerical solvers, converting legacy scientific software, and debugging large codebases against reference implementations.
As a case study, the Anthropic post describes using Claude Opus 4.6 to implement a differentiable cosmological Boltzmann solver in JAX. Boltzmann solvers such as CLASS and CAMB are used in cosmology to model the Cosmic Microwave Background and support the analysis of survey data. According to the post, a differentiable implementation can support gradient-based inference methods while also benefiting from automatic differentiation and compatibility with accelerators such as GPUs.
The post says the project required a different workflow from Anthropic’s earlier C compiler experiment because a Boltzmann solver is a tightly coupled numerical pipeline in which small errors can affect downstream outputs. Rather than relying mainly on parallel agents, Mishra-Sharma writes that this kind of task may be better suited to a single agent working sequentially, while using subagents when needed and comparing results against a reference implementation.
To manage long-running work, the article recommends keeping project instructions in a root-level ‘CLAUDE.md’ file and maintaining a ‘CHANGELOG.md’ file as portable long-term memory. It also highlights the importance of a test oracle, such as a reference implementation or existing test suite, so that AI agents can measure whether they are making progress and avoid repeating failed approaches.
The Anthropic post also presents Git as a coordination tool, recommending that the agent commit and push after every meaningful unit of work and run tests before each commit. For execution, Mishra-Sharma describes running Claude Code inside a tmux session on an HPC cluster using the SLURM scheduler, allowing the agent to continue working across multiple sessions with periodic human check-ins.
One orchestration method described in the article is the ‘Ralph loop,’ which prompts the agent to continue working until a stated success criterion is met. Mishra-Sharma writes that this kind of scaffolding can still help when models stop early or fail to complete all parts of a complex task, even as they become more capable overall.
According to the post, Anthropic’s Claude worked on the solver project over several days and reached sub-percent agreement with the reference CLASS implementation across several outputs. At the same time, Mishra-Sharma notes that the system had limitations, including gaps in test coverage and mistakes that a domain expert might have identified more quickly. He writes that the resulting solver is ‘not production-grade’ and ‘doesn’t match the reference CLASS implementation to an acceptable accuracy in every regime’.
Would you like to learn more about AI, tech, and digital diplomacy? If so, ask our Diplo chatbot!
