The authors argue that continuous batching can make LLMs more cost-effective and accessible for real-world applications. They provide an example of how to implement continuous batching with vLLM and Ray Serve, and announce plans to integrate continuous batching systems into Aviary, a web application for comparing the outputs of different LLMs. The article concludes by inviting readers to join the Ray community and attend the upcoming Ray Summit to learn more about using Ray for LLM applications.
Key takeaways:
- Large Language Models (LLMs) often have a large GPU memory footprint and compute cost, making serving a dominant cost for most real-world applications.
- Continuous batching, also known as dynamic batching or iteration-level scheduling, is a proposed optimization that can make significant differences in real-world workloads, with up to 23x throughput improvement.
- Continuous batching improves throughput by wasting fewer opportunities to schedule new requests, and improves latency by being capable of immediately injecting new requests into the compute stream.
- vLLM, an open-source project, builds upon Orca’s continuous batching design by taking full control of dynamic memory allocations, allowing it to significantly reduce different forms of GPU memory fragmentation.