Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Towards a Holistic Evaluation of LLMs on Factual Knowledge Recall

May 06, 2024 - news.bensbites.com
This study evaluates the factuality of outputs generated by large language models (LLMs) and the factors that affect their ability to recall factual knowledge. The researchers created FACT-BENCH, a benchmark covering various domains, property types, answer types, and knowledge popularity levels, to assess 31 models from 10 model families. The study found that instruction-tuning negatively impacts knowledge recall, with pretraining-only models outperforming their instruction-tuned counterparts. Larger models also outperformed smaller ones across all model families, but even the best performance from GPT-4 fell short of the upper-bound.

The study also explored the role of in-context exemplars using counterfactual demonstrations, which significantly reduced factual knowledge recall in large models. The researchers attributed this degradation to exemplars that contradicted a model's known knowledge and the number of such exemplars. They also found that fine-tuning LLaMA-7B on a model's known knowledge was beneficial and consistently outperformed fine-tuning on unknown and mixed knowledge. The researchers plan to make their benchmark publicly available.

Key takeaways:

  • The study focuses on evaluating the factuality of outputs generated by Large Language Models (LLMs) and the factors that influence their ability to recall factual knowledge.
  • A benchmark called FACT-BENCH was created to assess 31 models from 10 model families, revealing that pretraining-only models outperform instruction-tuned models and larger models outperform smaller ones.
  • The use of in-context exemplars leads to a significant degradation of factual knowledge recall for large models, especially when the exemplars contradict a model's known knowledge.
  • Fine-tuning on a model's known knowledge is beneficial and consistently outperforms fine-tuning on unknown and mixed knowledge.
View Full Article

Comments (0)

Be the first to comment!