AI shows promise in scientific research tasks

AI models assist in literature searches, complex proofs, and multi-step research tasks, reducing work that previously took days or weeks to hours.

FrontierScience is a new benchmark measuring AI capabilities in expert-level scientific reasoning across physics, chemistry, and biology, including Olympiad-style and real-world research tasks.

FrontierScience, a new benchmark from OpenAI, evaluates AI capabilities for expert-level scientific reasoning across physics, chemistry, and biology.

The benchmark measures Olympiad-style reasoning and real-world research tasks, showing how AI can aid complex scientific workflows. Generative AI models like GPT‑5 are now used for literature searches, complex proofs, and tasks that once took days or weeks.

The benchmark consists of two tracks: FrontierScience-Olympiad, with 100 questions created by international Olympiad medalists to assess constrained scientific reasoning, and FrontierScience-Research, with 60 multi-step research tasks developed by PhD scientists.

Initial evaluations show GPT‑5.2 scoring 77% on the Olympiad set and 25% on the Research set, outperforming other frontier models. The results show AI can support structured scientific reasoning but still struggles with open-ended problem solving and hypothesis generation.

FrontierScience also introduces a grading system tailored to each track. The Olympiad set uses short-answer verification, while the Research set employs a 10-point rubric assessing both final answers and intermediate reasoning steps.

Model-based grading allows for scalable evaluation of complex tasks, although human expert oversight remains ideal. Analyses reveal that AI models still make logic, calculation, and factual errors, particularly with niche scientific concepts.

While FrontierScience does not capture every aspect of scientific work, it provides a high-resolution snapshot of AI performance on difficult, expert-level problems. OpenAI plans to refine the benchmark, extend it to new domains, and combine it with real-world tests to track AI’s impact on scientific discovery.

The ultimate measure of success remains the novel insights and discoveries AI can help generate for the scientific community.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot