15 times Faster than Llama 2: Introducing DeciLM - NAS-Generated LLM with Variable GQA

Deci has introduced DeciLM 6B, a Large Language Model (LLM) with 5.7 billion parameters, and DeciLM 6B-Instruct, fine-tuned for instruction-following use cases. These models offer 15 times higher throughput than Llama 2 7B, ranking among the top-performing open-source LLMs in the 7 billion parameter category. DeciLM's unique architecture, generated using AutoNAC, Deci’s Neural Architecture Search engine, and its integration with Deci's inference SDK, significantly enhance throughput and inference efficiency. This increase in efficiency leads to a better user experience, reduced inference cost, and a lower carbon footprint.

DeciLM's architecture features a unique implementation of variable Grouped-Query Attention (GQA), which optimizes the tradeoff between efficiency and model quality. The architecture was generated using Deci's proprietary Neural Architecture Search (NAS) engine, AutoNAC. DeciLM was trained using a subset of the SlimPajamas dataset and underwent LoRA fine-tuning. The model's performance was benchmarked against leading open-source models, and it demonstrated superior memory efficiency and higher throughput. When combined with Deci's specialized inference SDK, Infery-LLM, DeciLM's throughput is 15 times that of Llama 2 7B’s on NVIDIA’s A10G GPU, resulting in a significant reduction in inference costs and carbon footprint. DeciLM is available for free download under the Llama 2 Community License Agreement.

Key takeaways:

DeciLM 6B is a new Large Language Model (LLM) with 5.7 billion parameters that delivers a throughput 15 times higher than Llama 2 7B, establishing a new benchmark for inference efficiency and speed.
DeciLM's unique architecture, generated using AutoNAC, Deci’s Neural Architecture Search engine, and its use of Variable Grouped-Query Attention, allows for an optimal balance between inference speed and the quality of the model’s outputs.
When combined with Deci's inference SDK, Infery-LLM, DeciLM 6B's throughput is 15 times that of Llama 2 7B’s on NVIDIA’s A10G GPU, leading to a significant reduction in inference costs and carbon footprint.
DeciLM is being released to the open-source community under the Llama 2 Community License Agreement, making it accessible for researchers, developers, and enthusiasts to leverage in their work.

15 times Faster than Llama 2: Introducing DeciLM - NAS-Generated LLM with Variable GQA

Key takeaways:

Comments (0)

Newsletter