Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Jun 12, 2024 - news.bensbites.com
The article introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones. The framework is particularly effective for models whose sizes exceed the device's memory capacity. PowerInfer-2 utilizes the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. It features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference and introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining to minimize and conceal the overhead caused by I/O operations.

The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. It is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.

Key takeaways:

  • The paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, especially for models whose sizes exceed the device's memory capacity.
  • PowerInfer-2 utilizes the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations.
  • PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference, and introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining.
  • The implementation of PowerInfer-2 supports a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks and can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.
View Full Article

Comments (0)

Be the first to comment!