The author also provides details on how the experiments were conducted, including the use of a linear regression model to determine latency per token. The experiments were conducted in May 2023, using a 500Mbit network in Estonia. The author used a paid account for OpenAI and US-East endpoints for Azure. The author concludes by providing the best linear fits and raw data points for each model.
Key takeaways:
- The response time of GPT APIs largely depends on the number of output tokens generated by the model.
- Azure's GPT-3.5 model is more than twice as fast as OpenAI's GPT-3.5 model, with a latency of 34ms per generated token compared to OpenAI's 73ms.
- OpenAI's GPT-4 model is almost three times slower than its GPT-3.5 model, with a latency of 196ms per generated token.
- To make GPT API responses faster, it is recommended to generate as few tokens as possible.