The article also explains the methodology for measuring these metrics. Tools are run daily in multiple data centers, a warmup connection is made to eliminate HTTP connection setup latency, and the TTFT clock starts when the HTTP request is made and stops when the first token is received. The number of output tokens is set to 20, and for each provider, three separate inferences are performed with the best result kept. The raw data, benchmarking tools, and website source code are all publicly available.
Key takeaways:
- The site provides reliable measurements for the performance of popular language learning models (LLMs), with stats updated daily.
- Three key performance metrics are used: Time To First Token (TTFT), Tokens Per Second (TPS), and Total time, which measures the total time from the start of the request until the response is complete.
- The methodology includes running tools daily in multiple data centers, making a warmup connection to remove any HTTP connection setup latency, and performing three separate inferences for each provider, keeping the best result.
- All data and benchmarking tools are publicly available, and suggestions for additional models to benchmark can be submitted via GitHub.