Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - kvcache-ai/Mooncake

Jun 29, 2024 - github.com
Mooncake is a serving platform for Kimi, a leading Language Model (LLM) service provided by Moonshot AI. The platform features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster to implement a disaggregated cache of KVCache. The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements.

Mooncake faces challenges due to highly overloaded scenarios and to mitigate these, a prediction-based early rejection policy has been developed. Experiments show that Mooncake excels in long-context scenarios, achieving up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs. Under real workloads, Mooncake’s innovative architecture enables Kimi to handle 75% more requests.

Key takeaways:

  • Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
  • Mooncake features a KVCache-centric disaggregated architecture that separates the prefill and decoding clusters, leveraging underutilized CPU, DRAM, and SSD resources of the GPU cluster.
  • The core of Mooncake is its KVCache-centric scheduler, which balances maximizing overall effective throughput while meeting latency-related Service Level Objectives (SLOs) requirements.
  • Experiments show that Mooncake excels in long-context scenarios, achieving up to a 525% increase in throughput in certain simulated scenarios while adhering to SLOs, and enabling Kimi to handle 75% more requests under real workloads.
View Full Article

Comments (0)

Be the first to comment!