The authors also explore various architectural decisions, such as the combination of Transformer and Mamba layers, and the mixing of experts, demonstrating their importance in large scale modeling. They highlight several interesting properties of these architectures revealed through the training and evaluation of Jamba. The authors plan to release checkpoints from various ablation runs to encourage further exploration of this novel architecture, and have made the weights of their implementation of Jamba publicly available under a permissive license.
Key takeaways:
- The authors introduce Jamba, a new base large language model based on a hybrid Transformer-Mamba mixture-of-experts (MoE) architecture.
- Jamba interleaves blocks of Transformer and Mamba layers, increasing model capacity while maintaining manageable active parameter usage.
- The model provides high throughput and small memory footprint compared to vanilla Transformers, and offers state-of-the-art performance on standard language model benchmarks and long-context evaluations.
- The authors plan to release checkpoints from various ablation runs and make the weights of their implementation of Jamba publicly available under a permissive license.