TheFastest.ai

The article discusses the speed of human conversations and the need for Language Learning Models (LLMs) to match this pace. It provides definitions for key performance metrics such as Time To First Token (TTFT), Tokens Per Second (TPS), and total time. TTFT measures how quickly a model can process a request and start outputting text, TPS measures the speed of text production, and total time is the sum of TTFT and the product of TPS and tokens. Lower values for TTFT and total time, and higher values for TPS indicate faster performance.

The article also explains the methodology for measuring these metrics. Tools are run daily in multiple data centers, a warmup connection is made to eliminate HTTP connection setup latency, and the TTFT clock starts when the HTTP request is made and stops when the first token is received. The number of output tokens is set to 20, and for each provider, three separate inferences are performed with the best result kept. The raw data, benchmarking tools, and website source code are all publicly available.

Key takeaways:

The site provides reliable measurements for the performance of popular language learning models (LLMs), with stats updated daily.
Three key performance metrics are used: Time To First Token (TTFT), Tokens Per Second (TPS), and Total time, which measures the total time from the start of the request until the response is complete.
The methodology includes running tools daily in multiple data centers, making a warmup connection to remove any HTTP connection setup latency, and performing three separate inferences for each provider, keeping the best result.
All data and benchmarking tools are publicly available, and suggestions for additional models to benchmark can be submitted via GitHub.

TheFastest.ai

Key takeaways:

Comments (0)

Newsletter