Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

2:4 Sparse Llama: Smaller Models for Efficient GPU Inference

Nov 29, 2024 - news.bensbites.com
Neural Magic has introduced Sparse Llama 3.1 8B, a sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B. The model is designed to reduce size by removing unnecessary connections while retaining accuracy, offering a promising dimension in model compression and efficiency. It features a 2:4 sparsity pattern designed for NVIDIA Ampere GPUs and newer, delivering up to 30% higher throughput and 1.8x lower latency from sparsity alone with vLLM. It also integrates with advanced 4-bit quantization methods like GPTQ and efficient Sparse-Marlin inference kernels, enabling faster inference anywhere from 1.4x to 4.9x depending on the hardware and scenario.

The model demonstrates exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios. It achieved 98.4% accuracy recovery on the Open LLM Leaderboard V1 and 97.3% recovery on the more challenging Mosaic Eval Gauntlet. During fine-tuning across math, code, and conversational AI tasks, it achieved full accuracy recovery and, in some cases, outperformed its dense counterparts. Using 4-bit post-training quantization combined with 2:4 sparsity delivered impressive inference speedups with vLLM nightly and minimal effects on accuracy for most cases.

Key takeaways:

  • Neural Magic has introduced Sparse Llama 3.1 8B, a sparse, highly accurate foundation model built on top of Meta’s Llama 3.1 8B, which offers up to 30% higher throughput and 1.8x lower latency.
  • The Sparse Llama model is designed to reduce the size of large language models by removing unneeded connections while retaining accuracy, making AI scaling less resource-intensive and costly.
  • The model has shown exceptional performance across few-shot benchmarks, fine-tuning tasks, and inference scenarios, demonstrating the efficiency and versatility of 2:4 sparsity.
  • Neural Magic has made the Sparse Llama base model and fine-tuned versions available on their Hugging Face organization, with open-sourced weights, evaluations, and benchmarks to encourage community experimentation and development.
View Full Article

Comments (0)

Be the first to comment!