The analysis revealed that most unfaithful claims were related to events and character states, requiring indirect reasoning over the narrative to invalidate. The researchers found that LLM-based auto-raters were not reliable for faithfulness, especially in detecting unfaithful claims. The study suggests that detecting unfaithful claims is an important future direction for summarization evaluation and long-context understanding. The paper also explores content selection errors in book-length summarization, identifying a systematic over-emphasis on events occurring towards the end of the book.
Key takeaways:
- The paper conducts a large-scale human evaluation of faithfulness and content selection on long-context large language models (LLMs) generated summaries of fictional books.
- The study uses a dataset of annotations on 3,158 claims made in LLM-generated summaries of 26 books, which helps to rank LLM summarizers based on faithfulness.
- The analysis reveals that most unfaithful claims relate to events and character states, and they generally require indirect reasoning over the narrative to invalidate.
- The paper also explores content selection errors in book-length summarization, identifying a systematic over-emphasis on events occurring towards the end of the book.