The companies have developed a new approach called sparse fine-tuning, which combines one-shot pruning, sparse pretraining, and fine-tuning on specific datasets to create highly sparse LLMs without sacrificing accuracy. The resulting sparse LLM reaches the same level of accuracy as its dense counterpart while being up to 70% smaller in size. Neural Magic’s DeepSparse engine addresses the challenges of deploying sparse LLMs for inference by delivering exceptional inference performance on CPUs. The breakthrough paves the way for more efficient training and deployment of LLMs, making them accessible to a broader range of organizations and industries.
Key takeaways:
- Cerebras and Neural Magic have achieved a breakthrough in large language models (LLMs) by unlocking unprecedented levels of sparsity, enabling up to 70% parameter reduction without compromising accuracy.
- The novel approach includes one-shot pruning, sparse pretraining, and fine-tuning on specific datasets to create highly sparse LLMs without sacrificing accuracy.
- Neural Magic’s DeepSparse engine delivers up to 3x faster inference compared to dense models, making sparse LLMs more accessible and cost-effective for real-world applications.
- To facilitate the adoption and further development of sparse LLMs, Cerebras and Neural Magic are releasing a comprehensive package containing the training recipe, model weights, code, data, and documentation.