The researchers also use a multi-level architecture for the N-gram module to increase the accuracy of the initial draft, thereby reducing inference latency. ANPD does not require retraining or additional GPU memory, making it an efficient and easy-to-implement enhancement. In tests, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67 times, demonstrating the effectiveness of the proposed ANPD.
Key takeaways:
- The study introduces Adaptive N-gram Parallel Decoding (ANPD), a new method that accelerates inference by allowing the simultaneous generation of multiple tokens in Large Language Models (LLMs).
- ANPD uses a two-stage approach: a rapid drafting phase with an N-gram module, and a verification phase where the original LLM confirms the proposed tokens.
- The method enhances processing speed while preserving the integrity of the LLM's original output, and does not require retraining or extra GPU memory.
- In experiments, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67x, proving the effectiveness of ANPD.