FABLES: Evaluating faithfulness and content selection in book-length summarization

This paper presents the first large-scale human evaluation of faithfulness and content selection in long-context large language models (LLMs) generated summaries of fictional books. The study focuses on books published in 2023 or 2024 and employs annotators who have read each book in full. The researchers collected a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, costing $5.2K USD. The results showed that Claude-3-Opus significantly outperformed all closed-source LLMs, while the open-source Mixtral performed similarly to GPT-3.5-Turbo.

The analysis revealed that most unfaithful claims were related to events and character states, requiring indirect reasoning over the narrative to invalidate. The researchers found that LLM-based auto-raters were not reliable for faithfulness, especially in detecting unfaithful claims. The study suggests that detecting unfaithful claims is an important future direction for summarization evaluation and long-context understanding. The paper also explores content selection errors in book-length summarization, identifying a systematic over-emphasis on events occurring towards the end of the book.

Key takeaways:

The paper conducts a large-scale human evaluation of faithfulness and content selection on long-context large language models (LLMs) generated summaries of fictional books.
The study uses a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, which helps to rank LLM summarizers based on faithfulness.
The analysis reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
The paper also explores content selection errors in book-length summarization, identifying a systematic over-emphasis on events occurring towards the end of the book.

FABLES: Evaluating faithfulness and content selection in book-length summarization

Key takeaways:

Comments (0)

Newsletter