The effectiveness of LongRoPE is demonstrated through extensive experiments on LLaMA2 and Mistral across various tasks. The models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations. This method addresses the challenges of high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions in LLMs.
Key takeaways:
- The paper introduces LongRoPE, a method that extends the context window of pre-trained large language models (LLMs) to 2048k tokens, a significant increase from the current limit of around 128k tokens.
- LongRoPE achieves this extension with only up to 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window.
- Three key innovations are introduced: non-uniformities in positional interpolation, a progressive extension strategy, and readjustment of LongRoPE on 8k length to recover the short context window performance.
- The models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.