A new round of the ORCA (Omni Research on Calculation in AI) benchmark reveals significant progress in how leading AI chatbots handle real-world mathematical problems, while also highlighting persistent limitations in reliability and consistency.
The latest results show Google’s Gemini 3 Flash moving clearly ahead of competing systems, correctly answering nearly three-quarters of the 500 practical questions used in the benchmark.
Our readers may recall that the platform previously analysed the first edition of the ORCA benchmark, examining how AI chatbots performed on everyday quantitative tasks rather than purely academic problems. The earlier analysis already showed notable gaps between systems and raised questions about the reliability of AI models for calculations people might encounter in daily life.
The second benchmark compares four widely accessible models: ChatGPT-5.2, Gemini 3 Flash, Grok-4.1 and DeepSeek V3.2. Gemini recorded the largest improvement, decisively outpacing the others. ChatGPT and DeepSeek posted smaller but steady gains, while Grok’s results declined slightly in several subject areas.




Performance improvements were uneven across domains, with Gemini showing particularly strong gains in fields such as biology, chemistry, physics and health-related calculations.
Closer examination of the errors reveals why AI still struggles with mathematical accuracy. Calculation mistakes have increased as a share of total errors, while rounding and formatting problems have decreased.
Researchers explain that large language models do not actually compute numbers in the same way that calculators do. Instead, they predict likely sequences of words and numbers, which can lead to small shortcuts during multi-step reasoning that eventually produce incorrect results.
The benchmark also highlights another challenge: instability. The same question can produce different answers when asked multiple times, even when the model initially responded correctly. Such variation reflects the probabilistic nature of AI systems.
As a result, the benchmark concludes that AI chatbots can assist with calculations but cannot yet match the consistency of traditional calculators, which always return the same answer for the same input.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!
