DeciLM's architecture features a unique implementation of variable Grouped-Query Attention (GQA), which optimizes the tradeoff between efficiency and model quality. The architecture was generated using Deci's proprietary Neural Architecture Search (NAS) engine, AutoNAC. DeciLM was trained using a subset of the SlimPajamas dataset and underwent LoRA fine-tuning. The model's performance was benchmarked against leading open-source models, and it demonstrated superior memory efficiency and higher throughput. When combined with Deci's specialized inference SDK, Infery-LLM, DeciLM's throughput is 15 times that of Llama 2 7B’s on NVIDIA’s A10G GPU, resulting in a significant reduction in inference costs and carbon footprint. DeciLM is available for free download under the Llama 2 Community License Agreement.
Key takeaways:
- DeciLM 6B is a new Large Language Model (LLM) with 5.7 billion parameters that delivers a throughput 15 times higher than Llama 2 7B, establishing a new benchmark for inference efficiency and speed.
- DeciLM's unique architecture, generated using AutoNAC, Deci’s Neural Architecture Search engine, and its use of Variable Grouped-Query Attention, allows for an optimal balance between inference speed and the quality of the model’s outputs.
- When combined with Deci's inference SDK, Infery-LLM, DeciLM 6B's throughput is 15 times that of Llama 2 7B’s on NVIDIA’s A10G GPU, leading to a significant reduction in inference costs and carbon footprint.
- DeciLM is being released to the open-source community under the Llama 2 Community License Agreement, making it accessible for researchers, developers, and enthusiasts to leverage in their work.