How fast can one reasonably expect to get inference on a ~70B model?

The author has been experimenting with deploying large models on various platforms like HF and AWS, but is disappointed with the slower-than-expected inference speeds. Despite using standard Nvidia A100, 4x GPU, 320 GB for these deployments, the author finds the speeds considerably slower than OpenAI, with response times of around 1200ms from gpt-3.5-turbo, 1600ms from gpt-4o, and 5000ms from llama-70b-instruct on a dedicated HF endpoint.

The author is questioning whether they might be missing something in their approach, or if their initial expectations were simply too high. They are seeking advice and insights from others regarding their experiences, tips, and tricks to potentially improve the performance of their self-hosted models.

Key takeaways:

The author has been testing large models on various platforms and is disappointed by the slower-than-expected inference speeds.
For reference, the author provides response times for different models, with gpt-3.5-turbo at ~1200ms, gpt-4o at ~1600ms, and llama-70b-instruct at ~5000ms.
The author has been using standard Nvidia A100, 4x GPU, 320 GB for these deployments.
The author is questioning whether they are missing something or if their expectations were unreasonable, and is seeking advice and experiences from others.

How fast can one reasonably expect to get inference on a ~70B model?

Key takeaways:

Comments (0)

Newsletter