Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Meta’s new multi-token prediction makes AI models up to 3X faster

May 06, 2024 - venturebeat.com
Researchers at Meta, Ecole des Ponts ParisTech, and Université Paris-Saclay have proposed a method to improve the speed and accuracy of AI large language models (LLMs) by making them predict multiple tokens simultaneously. This approach, which deviates from the traditional structure of auto-regressive language models that predict one token at a time, has shown significant benefits in some areas, including triple speeds and improved performance on generative tasks.

The researchers tested the multi-token prediction scheme on a variety of tasks with models ranging from 300 million to 13 billion parameters. The study found that multi-token prediction makes models up to three times faster at inference time across a wide range of batch sizes. It also promotes learning longer-term patterns, especially in experiments where the model is trained on "byte-level tokenization." The research could be beneficial for enterprise applications, providing faster inference and higher accuracy for generative tasks such as code completion.

Key takeaways:

  • Researchers suggest improving the accuracy and speed of AI large language models (LLMs) by making them predict multiple tokens simultaneously, a method that goes against the classic structure of auto-regressive language models.
  • The multi-token prediction technique provides substantial benefits in some areas, with triple speeds and better performance on generative tasks, and could become a powerful tool for some LLM applications.
  • Multi-token prediction makes models up to three times faster at inference time across a wide range of batch sizes and promotes learning longer-term patterns, especially in experiments where the model is trained on “byte-level tokenization”.
  • The potential of multi-token prediction for enterprise applications is the ability to provide faster inference and higher accuracy at little or no extra cost for generative tasks such as code completion, while leaving most of the LLM architecture intact, making it compatible with other optimization techniques for the Transformer block.
View Full Article

Comments (0)

Be the first to comment!