Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Apr 21, 2024 - news.bensbites.com
The study introduces Adaptive N-gram Parallel Decoding (ANPD), a novel approach aimed at accelerating the inference process in Large Language Models (LLMs) by enabling the simultaneous generation of multiple tokens. ANPD uses a two-stage process: a quick drafting phase using an N-gram module that adapts to the current interactive context, and a verification phase where the original LLM checks and confirms the proposed tokens. This method maintains the integrity of the LLM's original output while improving processing speed.

The researchers also use a multi-level architecture for the N-gram module to increase the accuracy of the initial draft, thereby reducing inference latency. ANPD does not require retraining or additional GPU memory, making it an efficient and easy-to-implement enhancement. In tests, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67 times, demonstrating the effectiveness of the proposed ANPD.

Key takeaways:

  • The study introduces Adaptive N-gram Parallel Decoding (ANPD), a new method that accelerates inference by allowing the simultaneous generation of multiple tokens in Large Language Models (LLMs).
  • ANPD uses a two-stage approach: a rapid drafting phase with an N-gram module, and a verification phase where the original LLM confirms the proposed tokens.
  • The method enhances processing speed while preserving the integrity of the LLM's original output, and does not require retraining or extra GPU memory.
  • In experiments, models like LLaMA and its fine-tuned variants showed speed improvements up to 3.67x, proving the effectiveness of ANPD.
View Full Article

Comments (0)

Be the first to comment!