The benchmark is designed to be challenging to effectively evaluate the current capabilities of LLMs. As models advance, the benchmarks are updated with harder questions to maintain a reasonable distribution of model scores. The project also compares the cost of using contemporary LLMs, with a table detailing the context length and price per input and output for each model. The Kagi LLM Benchmarking Project is inspired by the Wolfram LLM Benchmarking Project and the Aider LLM coding leaderboard.
Key takeaways:
- The Kagi LLM Benchmarking Project evaluates major large language models (LLMs) on their reasoning, coding, and instruction following capabilities using an unpolluted benchmark.
- The benchmark tests are frequently updated and are mostly novel to avoid overfitting and to provide a rigorous evaluation of the models' capabilities.
- The benchmarking project also provides a comparison of the cost of using contemporary LLMs, including the price per input and output.
- Kagi Assistant provides access to all the models in bold and usage is included in the Kagi subscription.