Sign up to save tools and stay up to date with the latest in AI
bg
bg
1

Launch HN: Outerport (YC S24) – Instant hot-swapping for AI model weights

Aug 23, 2024 - news.bensbites.com
Outerport, a distribution network for AI model weights, has been developed by Towaki and Allen to enable 'hot-swapping' of AI models and save on GPU costs. The system allows different models to be served on the same GPU machine with swap times of approximately 2 seconds, 150 times faster than the baseline. Outerport was created in response to the high costs of running AI models on the cloud, with cloud GPUs charged by usage time. The system is designed specifically for model weights running on GPUs, offering a solution to the inefficient use of expensive hardware and long start-up times caused by the large size of modern AI models.

Outerport is a caching system for model weights, allowing read-only models to be cached in pinned RAM for fast loading into GPU. It also maintains a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing. The system allows a single GPU machine to be 'multi-tenant', meaning multiple services with different models can run on the same machine. Initial simulation results show that Outerport can achieve a 40% reduction in GPU running time costs by smoothing out traffic peaks and enabling more effective horizontal scaling. The developers plan to release much of the system in an open core model and are exploring further developments such as more sophisticated compression algorithms and a central platform for model management and governance.

Key takeaways:

  • Outerport is a distribution network for AI model weights that enables 'hot-swapping' of AI models to save on GPU costs, with swap times 150x faster than baseline.
  • Outerport is a caching system for model weights, maintaining a cache across S3 to local SSD to RAM to GPU memory, optimizing for reduced data transfer costs and load balancing.
  • 'Hot-swapping' allows a single GPU machine to be 'multi-tenant', meaning multiple services with different models can run on the same machine, facilitating A/B testing or running different endpoints on the same machine.
  • Initial simulation results show that Outerport can achieve a 40% reduction in GPU running time costs, smoothing out peaks of traffic and enabling more effective horizontal scaling.
View Full Article

Comments (0)

Be the first to comment!