The study also revealed potential biases in the models' training data, particularly in regions like sub-Saharan Africa, where performance was notably poorer. Despite these shortcomings, the researchers, led by Peter Turchin from the Complexity Science Hub, remain optimistic about the potential for LLMs to assist historians. They are working on refining the benchmark by incorporating more diverse data and complex questions. The study underscores the need for improvement in LLMs but also highlights their potential utility in historical research.
Key takeaways:
```html
- LLMs like GPT-4, Llama, and Gemini struggle with high-level historical questions, achieving low accuracy on the Hist-LLM benchmark.
- LLMs tend to extrapolate from prominent historical data, leading to inaccuracies in less well-known historical contexts.
- Performance disparities were noted for certain regions, such as sub-Saharan Africa, indicating potential biases in LLM training data.
- Researchers are optimistic about improving LLMs for historical research by refining benchmarks and incorporating more diverse data.