2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

Neural Magic has introduced Sparse Llama 3.1 8B, a sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B. The model is designed to reduce size by removing unnecessary connections while retaining accuracy, offering a promising dimension in model compression and efficiency. It features a 2:4 sparsity pattern designed for NVIDIA Ampere GPUs and newer, delivering up to 30% higher throughput and 1.8x lower latency from sparsity alone with vLLM. It also integrates with advanced 4-bit quantization methods like GPTQ and efficient Sparse-Marlin inference kernels, enabling faster inference anywhere from 1.4x to 4.9x depending on the hardware and scenario.

The model demonstrates exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios. It achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1 and 97.3% recovery on the more challenging Mosaic Eval Gauntlet. During fine-tuning across math, code, and conversational AI tasks, it achieved full accuracy recovery and, in some cases, outperformed its dense counterparts. Using 4-bit post-training quantization combined with 2:4 sparsity delivered impressive inference speedups with vLLM nightly and minimal effects on accuracy for most cases.

Key takeaways:

Neural Magic has introduced Sparse Llama 3.1 8B, a sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B, which offers up to 30% higher throughput and 1.8x lower latency.
The Sparse Llama model is designed to reduce the size of large language models by removing unneeded connections while retaining accuracy, making AI scaling less resource-intensive and costly.
The model has shown exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios, demonstrating the efficiency and versatility of 2:4 sparsity.
Neural Magic has made the Sparse Llama base model and fine-tuned versions available on their Hugging Face organization, with open-sourced weights, evaluations, and benchmarks to encourage community experimentation and development.

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

Key takeaways:

Comments (0)

Newsletter