The article also presents a comparison of Punica's text generation throughput with other systems like HuggingFace Transformers, DeepSpeed, FasterTransformer, and vLLM. The benchmark considers different settings of LoRA model popularity. Punica outperforms these systems, achieving 12 times the throughput. The authors encourage readers to read their paper titled "Punica: Multi-Tenant LoRA Serving" for a more in-depth understanding.
Key takeaways:
- Punica is a system that enables running multiple Low rank adaptation (LoRA) finetuned models at the cost of running one, adding only 1% storage and memory overhead.
- LoRA is a parameter efficient way to add new knowledge to a pretrained Large Language Model (LLM), with a finetuned model only adding 1% storage and memory overhead.
- Punica uses a CUDA kernel, called Segmented Gather Matrix-Vector multiplication (SGMV), to efficiently compute the right-hand-side of the LoRA addon, preserving the strong batching effect.
- Compared to other systems like HuggingFace Transformers, DeepSpeed, FasterTransformer, vLLM, Punica achieves 12x throughput, making it significantly more efficient.