The author is questioning whether they might be missing something in their approach, or if their initial expectations were simply too high. They are seeking advice and insights from others regarding their experiences, tips, and tricks to potentially improve the performance of their self-hosted models.
Key takeaways:
- The author has been testing large models on various platforms and is disappointed by the slower-than-expected inference speeds.
- For reference, the author provides response times for different models, with gpt-3.5-turbo at ~1200ms, gpt-4o at ~1600ms, and llama-70b-instruct at ~5000ms.
- The author has been using standard Nvidia A100, 4x GPU, 320 GB for these deployments.
- The author is questioning whether they are missing something or if their expectations were unreasonable, and is seeking advice and experiences from others.