AI large language models, including GPT-4, struggle with advanced historical questions, achieving only 46% accuracy on a new benchmark, Hist-LLM, due to their tendency to extrapolate from prominent historical data and potential biases in training data, highlighting their limitations in nuanced historical inquiry. Researchers remain hopeful about improving these models to aid historians by refining benchmarks with more diverse data and complex questions.
Jan 19, 2025 - techcrunch.com