The study also explored the role of in-context exemplars using counterfactual demonstrations, which significantly reduced factual knowledge recall in large models. The researchers attributed this degradation to exemplars that contradicted a model's known knowledge and the number of such exemplars. They also found that fine-tuning LLaMA-7B on a model's known knowledge was beneficial and consistently outperformed fine-tuning on unknown and mixed knowledge. The researchers plan to make their benchmark publicly available.
Key takeaways:
- The study focuses on evaluating the factuality of outputs generated by Large Language Models (LLMs) and the factors that influence their ability to recall factual knowledge.
- A benchmark called FACT-BENCH was created to assess 31 models from 10 model families, revealing that pretraining-only models outperform instruction-tuned models and larger models outperform smaller ones.
- The use of in-context exemplars leads to a significant degradation of factual knowledge recall for large models, especially when the exemplars contradict a model's known knowledge.
- Fine-tuning on a model's known knowledge is beneficial and consistently outperforms fine-tuning on unknown and mixed knowledge.