31 Dec 2025

Best AI chatbot for maths accuracy revealed in new benchmark

Gemini ranks highest for everyday math accuracy, yet new research shows AI chatbots still give the wrong answer roughly 40 percent of the time across 500 tested problems.

AI tools are increasingly used for simple everyday calculations, yet a new benchmark suggests accuracy remains unreliable.

The ORCA study tested five major chatbots across 500 real-world maths prompts and found that users still face roughly a 40 percent chance of receiving the wrong answer.

Gemini from Google recorded the highest score at 63 percent, with xAI’s Grok almost level at 62.8 percent. DeepSeek followed with 52 percent, while ChatGPT scored 49.4 percent, and Claude placed last at 45.2 percent.

Performance varied sharply across subjects, with maths and conversion tasks producing the best results, but physics questions dragged scores down to an average accuracy below 40 percent.

Researchers identified most errors as sloppy calculations or rounding mistakes, rather than deeper failures to understand the problem. Finance and economics questions highlighted the widest gaps between the models, while DeepSeek struggled most in biology and chemistry, with barely one correct answer in ten.

Users are advised to double-check results whenever accuracy is crucial. A calculator or a verified source is still advised instead of relying entirely on an AI chatbot for numerical certainty.

Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!