When applied to the Mistral and Mixtral models, the neuron sparsification method activates only 2.5 billion and 4.3 billion parameters per inference iteration, respectively, while improving model performance. The sparsity achieves a 2-5x decoding speedup, with the TurboSparse-Mixtral-47B model achieving an inference speed of 11 tokens per second on mobile phones.
Key takeaways:
- The paper proposes a novel dReLU function to improve activation sparsity in large language models (LLMs), which can significantly accelerate the inference process without compromising performance.
- Commonly used activation functions like SwiGLU and GeGLU exhibit limited sparsity, and replacing them with ReLU does not achieve sufficient sparsity.
- The authors also propose a high-quality training data mixture ratio to facilitate effective sparsification, and leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to boost efficiency.
- By applying their neuron sparsification method to the Mistral and Mixtral models, they achieved a 2-5x decoding speedup and an inference speed of 11 tokens per second on mobile phones.