Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

GitHub - punica-ai/punica: Serving multiple LoRA finetuned LLM as one

Nov 08, 2023 - github.com
The article discusses Punica, a system that allows for the efficient running of multiple Low Rank Adaptation (LoRA) finetuned models at the cost of running one. LoRA is a parameter-efficient method of adding new knowledge to a pretrained Large Language Model (LLM). While a pretrained LLM requires hundreds of GB storage, a LoRA finetuned model only adds 1% storage and memory overhead. Punica achieves this by adding two small matrices to the weight of the pretrained model and using a CUDA kernel called Segmented Gather Matrix-Vector multiplication (SGMV) for efficient computation.

The article also presents a comparison of Punica's text generation throughput with other systems like HuggingFace Transformers, DeepSpeed, FasterTransformer, and vLLM. The benchmark considers different settings of LoRA model popularity. Punica outperforms these systems, achieving 12 times the throughput. The authors encourage readers to read their paper titled "Punica: Multi-Tenant LoRA Serving" for a more in-depth understanding.

Key takeaways:

  • Punica is a system that enables running multiple Low rank adaptation (LoRA) finetuned models at the cost of running one, adding only 1% storage and memory overhead.
  • LoRA is a parameter efficient way to add new knowledge to a pretrained Large Language Model (LLM), with a finetuned model only adding 1% storage and memory overhead.
  • Punica uses a CUDA kernel, called Segmented Gather Matrix-Vector multiplication (SGMV), to efficiently compute the right-hand-side of the LoRA addon, preserving the strong batching effect.
  • Compared to other systems like HuggingFace Transformers, DeepSpeed, FasterTransformer, vLLM, Punica achieves 12x throughput, making it significantly more efficient.
View Full Article

Comments (0)

Be the first to comment!