The author further demonstrates the effectiveness of attention sinks through various experiments, including perplexity experiments, endless generation experiments, and a chat assistant experiment. The results show that LLMs using window attention with attention sinks have constant space complexity and a stable perplexity. The author concludes that attention sinks should be considered by any organization or user looking to use assistant-style LLMs, and provides a Python module, attention_sinks, as a drop-in replacement for the `transformers` API.
Key takeaways:
- Attention sinks, a method that uses window attention with attention sink tokens, allows pretrained chat-style Large Language Models (LLMs) to maintain fluency across hundreds of subsequent prompts and enables constant memory usage.
- The attention sinks method is particularly beneficial for chat-assistant LLMs, as it helps to overcome limitations related to VRAM usage and loss of fluency, which are common issues when using the 'transformers' approach.
- The attention_sinks Python module has been released, acting as a drop-in replacement for the 'transformers' API. It supports all models using the Llama, Mistral, Falcon, MPT, and GPT-NeoX (Pythia) architectures.
- While attention sinks can handle infinite-length inputs, they do not expand the context window of LLMs. The models will only recognize the latest tokens, making them ideal for streaming applications like multi-round dialogues.