Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

The study introduces Adaptive N-gram Parallel Decoding (ANPD), a novel approach aimed at accelerating the inference process in Large Language Models (LLMs) by enabling the simultaneous generation of multiple tokens. ANPD uses a two-stage process: a quick drafting phase using an N-gram module that adapts to the current interactive context, and a verification phase where the original LLM checks and confirms the proposed tokens. This method maintains the integrity of the LLM's original output while improving processing speed.

The researchers also use a multi-level architecture for the N-gram module to increase the accuracy of the initial draft, thereby reducing inference latency. ANPD does not require retraining or additional GPU memory, making it an efficient and easy-to-implement enhancement. In tests, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67 times, demonstrating the effectiveness of the proposed ANPD.

Key takeaways:

The study introduces Adaptive N-gram Parallel Decoding (ANPD), a new method that accelerates inference by allowing the simultaneous generation of multiple tokens in Large Language Models (LLMs).
ANPD uses a two-stage approach: a rapid drafting phase with an N-gram module, and a verification phase where the original LLM confirms the proposed tokens.
The method enhances processing speed while preserving the integrity of the LLM's original output, and does not require retraining or extra GPU memory.
In experiments, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67x, proving the effectiveness of ANPD.

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Key takeaways:

Comments (0)

Newsletter