AI and LLMs struggle with historical accuracy in advanced tests
Leading AI systems perform poorly on nuanced historical exams, achieving only 46% accuracy at best.

According to a recent study, AI models have shown limitations in tackling high-level historical inquiries. Researchers tested three leading large language models (LLMs) — GPT-4, Llama, and Gemini — using a newly developed benchmark, Hist-LLM. The test, based on the Seshat Global History Databank, revealed disappointing results, with GPT-4 Turbo achieving only 46% accuracy, barely surpassing random guessing.
Researchers from Austria’s Complexity Science Hub presented the findings at the NeurIPS conference last month. Co-author Maria del Rio-Chanona highlighted that while LLMs excel at basic facts, they struggle with nuanced, PhD-level historical questions. Errors included incorrect claims about ancient Egypt’s military and armour development, often due to the models extrapolating from prominent but irrelevant data.
Biases in training data also emerged, with models underperforming on questions related to underrepresented regions like sub-Saharan Africa. Lead researcher Peter Turchin acknowledged these shortcomings but emphasised the potential of LLMs to support historians with future improvements.
Efforts are underway to refine the benchmark by incorporating more diverse data and crafting complex questions. Researchers remain optimistic about AI’s capacity to assist in historical research despite its current gaps.