Kagi LLM Benchmarking Project

The Kagi LLM Benchmarking Project evaluates large language models (LLMs) on their reasoning, coding, and instruction following capabilities using a constantly updated, unpolluted benchmark. The project assesses models through diverse tasks and provides a rigorous evaluation of their capabilities. The benchmark includes metrics such as overall model quality, total tokens output, total cost to run the test, median response latency, and average speed in tokens per second. The project also compares the cost of using contemporary LLMs.

The benchmark is designed to be challenging to effectively evaluate the current capabilities of LLMs. As models advance, the benchmarks are updated with harder questions to maintain a reasonable distribution of model scores. The project also compares the cost of using contemporary LLMs, with a table detailing the context length and price per input and output for each model. The Kagi LLM Benchmarking Project is inspired by the Wolfram LLM Benchmarking Project and the Aider LLM coding leaderboard.

Key takeaways:

The Kagi LLM Benchmarking Project evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities using an unpolluted benchmark.
The benchmark tests are frequently updated and are mostly novel to avoid overfitting and to provide a rigorous evaluation of the models' capabilities.
The benchmarking project also provides a comparison of the cost of using contemporary LLMs, including the price per input and output.
Kagi Assistant provides access to all the models in bold and usage is included in the Kagi subscription.

Kagi LLM Benchmarking Project | Kagi's Docs

Key takeaways:

Comments (0)

Newsletter