Ask HN: How does deploying a fine-tuned model work

The article discusses the process of deploying a fine-tuned model, specifically a version of Llama trained on GPUs. The author, a beginner in the field, asks whether the model needs to be running on the GPUs all the time or if it can be hosted on a web server.

The response suggests that the model will likely need to be served on GPUs. If it's a small model, it might be possible to host it on a regular server with CPU inference. A larger model could also run on a CPU, but it would be significantly slower. The ideal scenario would be to use GPU inference, either running on GPUs all the time to avoid cold start times, or on serverless GPUs, which would require startup time when needed.

Key takeaways:

The user has built a fine-tuned model and is seeking advice on how to deploy it and use it in an app.
The model can potentially be hosted on a web server, but it may need to be running on GPUs all the time depending on its size and complexity.
If the model is small, it might be possible to host it on a regular server with CPU inference.
For larger models, GPU inference is recommended, which can either be running on GPUs all the time or on serverless GPUs, with the latter option potentially involving start-up times of around 10 seconds.

Ask HN: How does deploying a fine-tuned model work

Key takeaways:

Comments (0)

Newsletter