Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

This study evaluates the factuality of outputs generated by large language models (LLMs) and the factors that affect their ability to recall factual knowledge. The researchers created FACT-BENCH, a benchmark covering various domains, property types, answer types, and knowledge popularity levels, to assess 31 models from 10 model families. The study found that instruction-tuning negatively impacts knowledge recall, with pretraining-only models outperforming their instruction-tuned counterparts. Larger models also outperformed smaller ones across all model families, but even the best performance from GPT-4 fell short of the upper-bound.

The study also explored the role of in-context exemplars using counterfactual demonstrations, which significantly reduced factual knowledge recall in large models. The researchers attributed this degradation to exemplars that contradicted a model's known knowledge and the number of such exemplars. They also found that fine-tuning LLaMA-7B on a model's known knowledge was beneficial and consistently outperformed fine-tuning on unknown and mixed knowledge. The researchers plan to make their benchmark publicly available.

Key takeaways:

The study focuses on evaluating the factuality of outputs generated by Large Language Models (LLMs) and the factors that influence their ability to recall factual knowledge.
A benchmark called FACT-BENCH was created to assess 31 models from 10 model families, revealing that pretraining-only models outperform instruction-tuned models and larger models outperform smaller ones.
The use of in-context exemplars leads to a significant degradation of factual knowledge recall for large models, especially when the exemplars contradict a model's known knowledge.
Fine-tuning on a model's known knowledge is beneficial and consistently outperforms fine-tuning on unknown and mixed knowledge.

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

Key takeaways:

Comments (0)

Newsletter