The authors also observe an interesting phenomenon called attention sink, where keeping the Key and Value states (KV) of initial tokens can largely recover the performance of window attention. They discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. The authors plan to release the core code of StreamingLLM, perplexity evn code, a Streaming Llama Chatbot demo, and the StreamEval dataset and evaluation code.
Key takeaways:
- The paper introduces StreamingLLM, an efficient framework that enables Large Language Models (LLMs) to generalize to infinite sequence length without any fine-tuning.
- StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
- The authors observed an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention.
- In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup.