AI systems have demonstrated growing capabilities in advanced mathematics, according to benchmark results published by the non-profit organisation First Proof.
The organisation evaluated four frontier AI systems, including ChatGPT 5.5 Pro, against ten unpublished research-level mathematical problems contributed by leading mathematicians.
The benchmark found that seven of the ten problems received at least one solution judged to be correct by expert reviewers across the participating systems. One notable result involved a stochastic partial differential equations problem, where an AI system produced a correct solution using an approach different from the human-developed proof, drawing praise from expert referees for its originality.
Despite the progress, significant limitations remain.
Several problems remained unsolved, including a metric geometry challenge on which none of the systems made meaningful progress. Reviewers also reported that AI systems handled routine mathematical reasoning effectively but continued to struggle with the most challenging conceptual and creative aspects of proof construction.
Why does it matter?
The benchmark offers one of the most demanding independent tests of AI performance in advanced mathematics, a field often viewed as a proxy for higher-level reasoning and scientific problem-solving. The results suggest that frontier AI systems are increasingly capable of contributing to specialised research tasks and, in some cases, generating approaches that differ from those developed by human experts.
At the same time, the findings highlight the limits of current AI systems. While they can assist with complex reasoning and formal problem-solving, they continue to struggle with the deepest conceptual challenges that often drive mathematical breakthroughs. This suggests that AI may increasingly serve as a research assistant and discovery tool, while human expertise remains essential for guiding and validating scientific advances.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!
