The system goes beyond classic language model APIs, allowing users to employ any fine-tuning and sampling methods, execute custom paths through the model, or see its hidden states. It combines the comforts of an API with the flexibility of PyTorch. The project is part of the BigScience research workshop and has been featured on TechCrunch. Users can join the Discord or subscribe via email to follow the development of the project.
Key takeaways:
- Large language models like Llama 2, Stable Beluga 2, Guanaco-65B, or BLOOM-176B can be run collaboratively, with each participant loading a small part of the model.
- Single-batch inference runs at up to 6 steps/sec for Llama 2 and around 1 step/sec for BLOOM, which is up to 10x faster than offloading, enabling the creation of chatbots and interactive apps.
- The system goes beyond classic language model APIs, allowing any fine-tuning and sampling methods, executing custom paths through the model, or viewing its hidden states.
- The project is a part of the BigScience research workshop and offers the comforts of an API with the flexibility of PyTorch.