Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

The article discusses a novel approach to accelerating the inference process of large language models (LLMs) by exploiting activation sparsity. The authors propose a new dReLU function to improve LLM activation sparsity and a high-quality training data mixture ratio to facilitate effective sparsification. They also use sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to enhance efficiency.

When applied to the Mistral and Mixtral models, the neuron sparsification method activates only 2.5 billion and 4.3 billion parameters per inference iteration, respectively, while improving model performance. The sparsity achieves a 2-5x decoding speedup, with the TurboSparse-Mixtral-47B model achieving an inference speed of 11 tokens per second on mobile phones.

Key takeaways:

The paper proposes a novel dReLU function to improve activation sparsity in large language models (LLMs), which can significantly accelerate the inference process without compromising performance.
Commonly used activation functions like SwiGLU and GeGLU exhibit limited sparsity, and replacing them with ReLU does not achieve sufficient sparsity.
The authors also propose a high-quality training data mixture ratio to facilitate effective sparsification, and leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to boost efficiency.
By applying their neuron sparsification method to the Mistral and Mixtral models, they achieved a 2-5x decoding speedup and an inference speed of 11 tokens per second on mobile phones.

Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters

Key takeaways:

Comments (0)

Newsletter