Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data

Jun 29, 2024 - news.bensbites.com
The article discusses a new approach to improve the information retrieval and reasoning capabilities of Large Language Models (LLMs) when processing long-context inputs. The researchers propose a finetuning method that uses a synthetic dataset composed of numerical key-value retrieval tasks. Experiments on models like GPT-3.5 Turbo and Mistral 7B show that this finetuning significantly enhances the LLMs' performance in longer-context settings. The finetuned models also demonstrate a transfer of skills from synthetic to real task evaluations, with an improvement of up to 10.5% on certain tasks.

The study also found that the performance of finetuned LLMs on general benchmarks remains almost constant. However, LLMs finetuned on other baseline long-context augmentation data can encourage hallucination, causing a performance drop. For instance, on TriviaQA, Mistral 7B finetuned on the synthetic data did not show a performance drop, while other baseline data caused a drop ranging from 2.33% to 6.19%. The research underscores the potential of finetuning on synthetic data to enhance the performance of LLMs on longer-context tasks.

Key takeaways:

  • Large Language Models (LLMs) have been found to struggle with accurately retrieving information and maintaining reasoning capabilities when processing long-context inputs.
  • A finetuning approach using a synthetic dataset comprising numerical key-value retrieval tasks can significantly improve LLMs' information retrieval and reasoning capabilities in longer-context settings.
  • Finetuning on synthetic data can lead to a transfer of skills from synthetic to real task evaluations, with significant improvements noted in models like GPT-3.5 Turbo and Mistral 7B.
  • While finetuned LLMs' performance on general benchmarks remains almost constant, LLMs finetuned on other baseline long-context augmentation data can encourage hallucination, potentially leading to performance drops.
View Full Article

Comments (0)

Be the first to comment!