The study also found that the performance of finetuned LLMs on general benchmarks remains almost constant. However, LLMs finetuned on other baseline long-context augmentation data can encourage hallucination, causing a performance drop. For instance, on TriviaQA, Mistral 7B finetuned on the synthetic data did not show a performance drop, while other baseline data caused a drop ranging from 2.33% to 6.19%. The research underscores the potential of finetuning on synthetic data to enhance the performance of LLMs on longer-context tasks.
Key takeaways:
- Large Language Models (LLMs) have been found to struggle with accurately retrieving information and maintaining reasoning capabilities when processing long-context inputs.
- A finetuning approach using a synthetic dataset comprising numerical key-value retrieval tasks can significantly improve LLMs' information retrieval and reasoning capabilities in longer-context settings.
- Finetuning on synthetic data can lead to a transfer of skills from synthetic to real task evaluations, with significant improvements noted in models like GPT-3.5 Turbo and Mistral 7B.
- While finetuned LLMs' performance on general benchmarks remains almost constant, LLMs finetuned on other baseline long-context augmentation data can encourage hallucination, potentially leading to performance drops.