Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

🕳️ Attention Sinks in LLMs for endless fluency

Oct 11, 2023 - news.bensbites.co
The article discusses the use of attention sinks with window attention to improve the performance of pretrained chat-style Large Language Models (LLMs) such as Llama, Mistral, MPT, Falcon, and GPT-NeoX (Pythia). The author explains that attention sinks allow these models to maintain fluency across hundreds of subsequent prompts, unlike when these models are loaded using `transformers`. This approach also allows for constant memory usage, addressing the linear space complexity issue that most LLMs loaded with `transformers` face, which often results in memory issues.

The author further demonstrates the effectiveness of attention sinks through various experiments, including perplexity experiments, endless generation experiments, and a chat assistant experiment. The results show that LLMs using window attention with attention sinks have constant space complexity and a stable perplexity. The author concludes that attention sinks should be considered by any organization or user looking to use assistant-style LLMs, and provides a Python module, attention_sinks, as a drop-in replacement for the `transformers` API.

Key takeaways:

  • Attention sinks, a method that uses window attention with attention sink tokens, allows pretrained chat-style Large Language Models (LLMs) to maintain fluency across hundreds of subsequent prompts and enables constant memory usage.
  • The attention sinks method is particularly beneficial for chat-assistant LLMs, as it helps to overcome limitations related to VRAM usage and loss of fluency, which are common issues when using the 'transformers' approach.
  • The attention_sinks Python module has been released, acting as a drop-in replacement for the 'transformers' API. It supports all models using the Llama, Mistral, Falcon, MPT, and GPT-NeoX (Pythia) architectures.
  • While attention sinks can handle infinite-length inputs, they do not expand the context window of LLMs. The models will only recognize the latest tokens, making them ideal for streaming applications like multi-round dialogues.
View Full Article

Comments (0)

Be the first to comment!