The authors have implemented their LLM inference solution on an Intel GPU and made it publicly available. When compared to the standard HuggingFace implementation, their solution achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs on Intel GPU.
Key takeaways:
- The paper proposes an efficient Large Language Models (LLMs) inference solution with low latency and high throughput.
- The LLM decoder layer is simplified by fusing data movement and element-wise operations to reduce memory access frequency and lower system latency.
- A segment KV cache policy is proposed for effective device memory management, which helps increase the runtime batch size and improve system throughput.
- The proposed solution, when implemented on Intel GPU, achieves up to 7x lower token latency and 27x higher throughput for some popular LLMs compared to the standard HuggingFace implementation.