Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy

Cerebras and Neural Magic have made a significant breakthrough in large language models (LLMs), achieving up to 70% parameter reduction without compromising accuracy. This was achieved through state-of-the-art pruning techniques, sparse pretraining, and purpose-built hardware. The companies' novel approach to sparsity in LLMs has resulted in record 70% sparsity, compared to GPUs which rarely achieve even 50% sparsity. The Cerebras CS-3 system provided up to 8x training acceleration, while Neural Magic’s DeepSparse engine delivers up to 3x faster inference compared to dense models.

The companies have developed a new approach called sparse fine-tuning, which combines one-shot pruning, sparse pretraining, and fine-tuning on specific datasets to create highly sparse LLMs without sacrificing accuracy. The resulting sparse LLM reaches the same level of accuracy as its dense counterpart while being up to 70% smaller in size. Neural Magic’s DeepSparse engine addresses the challenges of deploying sparse LLMs for inference by delivering exceptional inference performance on CPUs. The breakthrough paves the way for more efficient training and deployment of LLMs, making them accessible to a broader range of organizations and industries.

Key takeaways:

Cerebras and Neural Magic have achieved a breakthrough in large language models (LLMs) by unlocking unprecedented levels of sparsity, enabling up to 70% parameter reduction without compromising accuracy.
The novel approach includes one-shot pruning, sparse pretraining, and fine-tuning on specific datasets to create highly sparse LLMs without sacrificing accuracy.
Neural Magic’s DeepSparse engine delivers up to 3x faster inference compared to dense models, making sparse LLMs more accessible and cost-effective for real-world applications.
To facilitate the adoption and further development of sparse LLMs, Cerebras and Neural Magic are releasing a comprehensive package containing the training recipe, model weights, code, data, and documentation.

Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy - Cerebras

Key takeaways:

Comments (0)

Newsletter