Meta’s AI benchmarking practices under scrutiny
Meta denies allegations it rigged Llama 4 benchmarks, highlighting broader issues with how AI models are evaluated in real-world conditions.

Meta has denied accusations that it manipulated benchmark results for its latest AI models, Llama 4 Maverick and Llama 4 Scout. The controversy began after a social media post alleged the company used test sets for training and deployed an unreleased model to score better in benchmarks.
Ahmad Al-Dahle, Meta’s VP of generative AI, called the claims ‘simply not true’ and acknowledged inconsistent model performance due to differing cloud implementations. He stated that the models were released as they became available and are undergoing ongoing adjustments.
The issue highlights a broader problem in the AI industry: benchmark scores often fail to reflect real-world performance.
Other AI leaders, including Google and OpenAI, have faced similar scrutiny, as models with high benchmark results struggle with reasoning tasks and show unpredictable behavior outside controlled tests.
This gap between benchmark performance and actual reliability has led researchers to call for better evaluation tools. Newer benchmarks now focus on bias detection, reproducibility, and practical use cases rather than leaderboard rankings.
Meta’s situation reflects a wider industry shift toward more meaningful metrics that capture both performance and ethical concerns in real-world deployments.
Would you like to learn more about AI, tech and digital diplomacy? If so, ask our Diplo chatbot!