The researchers tested the multi-token prediction scheme on a variety of tasks with models ranging from 300 million to 13 billion parameters. The study found that multi-token prediction makes models up to three times faster at inference time across a wide range of batch sizes. It also promotes learning longer-term patterns, especially in experiments where the model is trained on "byte-level tokenization." The research could be beneficial for enterprise applications, providing faster inference and higher accuracy for generative tasks such as code completion.
Key takeaways:
- Researchers suggest improving the accuracy and speed of AI large language models (LLMs) by making them predict multiple tokens simultaneously, a method that goes against the classic structure of auto-regressive language models.
- The multi-token prediction technique provides substantial benefits in some areas, with triple speeds and better performance on generative tasks, and could become a powerful tool for some LLM applications.
- Multi-token prediction makes models up to three times faster at inference time across a wide range of batch sizes and promotes learning longer-term patterns, especially in experiments where the model is trained on “byte-level tokenization”.
- The potential of multi-token prediction for enterprise applications is the ability to provide faster inference and higher accuracy at little or no extra cost for generative tasks such as code completion, while leaving most of the LLM architecture intact, making it compatible with other optimization techniques for the Transformer block.