The article discusses the challenges and accuracy of Large Language Models (LLMs) in recalling Bible scripture word for word. The author created a benchmark to evaluate various LLMs, using six different scenarios to test their ability to accurately recall scripture. The tests were conducted with a temperature setting of 0 to prioritize accuracy over creativity. Results showed that larger models, such as Llama 405B, GPT 4o, and Claude Sonnet, performed well across all tests, accurately recalling verses and chapters. In contrast, smaller models often mixed up translations or hallucinated responses, while medium-sized models preserved the intention of the verses but sometimes paraphrased or combined translations.
The article concludes that for textually accurate Bible verse recall, larger models are more reliable, while smaller models may still be useful for discussions referencing scripture by Book/Chapter/Verse. However, it is recommended to use an actual Bible for precise text. The author suggests that future improvements in smaller models may enhance their performance on such benchmarks, but acknowledges the limitations of encoding extensive information into smaller models. The full test results and methodology are available for further review, and feedback is encouraged for potential additional tests.
Key takeaways:
LLMs often struggle with accurately quoting scripture due to their tendency to hallucinate responses.
Larger models like Llama 405B, GPT 4o, and Claude Sonnet perform better in recalling Bible verses accurately.
Smaller models frequently mix up translations or paraphrase verses, making them less reliable for precise scripture recall.
For accurate biblical text, it's recommended to use larger models or refer to an actual Bible.