Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. It also maintains a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing. The system allows a single GPU machine to be 'multi-tenant', meaning multiple services with different models can run on the same machine. Initial simulation results show that Outerport can achieve a 40% reduction in GPU running time costs by smoothing out traffic peaks and enabling more effective horizontal scaling. The developers plan to release much of the system in an open core model and are exploring further developments such as more sophisticated compression algorithms and a central platform for model management and governance.
Key takeaways:
- Outerport is a distribution network for AI model weights that enables 'hot-swapping' of AI models to save on GPU costs, with swap times 150x faster than baseline.
- Outerport is a caching system for model weights, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.
- 'Hot-swapping' allows a single GPU machine to be 'multi-tenant', meaning multiple services with different models can run on the same machine, facilitating A/B testing or running different endpoints on the same machine.
- Initial simulation results show that Outerport can achieve a 40% reduction in GPU running time costs, smoothing out peaks of traffic and enabling more effective horizontal scaling.