AI isn’t very good at history, new paper finds

A recent study has found that large language models (LLMs) like OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini struggle with high-level historical questions, achieving only about 46% accuracy on a new benchmark called Hist-LLM. This benchmark, based on the Seshat Global History Databank, was presented at the NeurIPS conference and highlights the models' limitations in understanding nuanced historical details. Researchers, including Maria del Rio-Chanona from University College London, suggest that LLMs excel at tasks like coding but falter in history due to their tendency to extrapolate from prominent historical data, making it difficult to retrieve obscure information.

The study also revealed potential biases in the models' training data, particularly in regions like sub-Saharan Africa, where performance was notably poorer. Despite these shortcomings, the researchers, led by Peter Turchin from the Complexity Science Hub, remain optimistic about the potential for LLMs to assist historians. They are working on refining the benchmark by incorporating more diverse data and complex questions. The study underscores the need for improvement in LLMs but also highlights their potential utility in historical research.

Key takeaways:

LLMs like GPT-4, Llama, and Gemini struggle with high-level historical questions, achieving low accuracy on the Hist-LLM benchmark.
LLMs tend to extrapolate from prominent historical data, leading to inaccuracies in less well-known historical contexts.
Performance disparities were noted for certain regions, such as sub-Saharan Africa, indicating potential biases in LLM training data.
Researchers are optimistic about improving LLMs for historical research by refining benchmarks and incorporating more diverse data.

AI isn’t very good at history, new paper finds | TechCrunch

Key takeaways:

Comments (0)

Newsletter